[RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
@ 2025-02-20  5:20 Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 01/26] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                   ` (27 more replies)
  0 siblings, 28 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

To check luf's stability, I ran a heavy LLM inference workload consuming
210GiB over 7 days on a machine with 140GiB memory, and decided it's
stable enough.

I'm posting the latest version so that anyone can try luf mechanism if
wanted by any chance.  However, I tagged RFC again because there are
still issues that should be resolved to merge to mainline:

   1. Even though system wide total cpu time for TLB shootdown is
      reduced over 95%, page allocation paths should take additional cpu
      time shifted from page reclaim to perform TLB shootdown.

   2. We need luf debug feature to detect when luf goes wrong by any
      chance.  I implemented just a draft version that checks the sanity
      on mkwrite(), kmap(), and so on.  I need to gather better ideas
      to improve the debug feature.

---

Hi everyone,

While I'm working with a tiered memory system e.g. CXL memory, I have
been facing migration overhead esp. tlb shootdown on promotion or
demotion between different tiers.  Yeah..  most tlb shootdowns on
migration through hinting fault can be avoided thanks to Huang Ying's
work, commit 4d4b6d66db ("mm,unmap: avoid flushing tlb in batch if PTE
is inaccessible").

However, it's only for migration through hinting fault.  I thought it'd
be much better if we have a general mechanism to reduce all the tlb
numbers that we can apply to any unmap code, that we normally believe
tlb flush should be followed.

I'm suggesting a new mechanism, LUF(Lazy Unmap Flush), that defers tlb
flush until folios that have been unmapped and freed, eventually get
allocated again.  It's safe for folios that had been mapped read-only
and were unmapped, as long as the contents of the folios don't change
while staying in pcp or buddy so we can still read the data through the
stale tlb entries.

tlb flush can be defered when folios get unmapped as long as it
guarantees to perform tlb flush needed, before the folios actually
become used, of course, only if all the corresponding ptes don't have
write permission.  Otherwise, the system will get messed up.

To achieve that, for the folios that map only to non-writable tlb
entries, prevent tlb flush during unmapping but perform it just before
the folios actually become used, out of buddy or pcp.

However, we should cancel the pending by LUF and perform the deferred
TLB flush right away when:

   1. a writable pte is newly set through fault handler
   2. a file is updated
   3. kasan needs poisoning on free
   4. the kernel wants to init pages on free

No matter what type of workload is used for performance evaluation, the
result would be positive thanks to the unconditional reduction of tlb
flushes, tlb misses and interrupts.  For the test, I picked up one of
the most popular and heavy workload, llama.cpp that is a
LLM(Large Language Model) inference engine.

The result would depend on memory latency and how often reclaim runs,
which implies tlb miss overhead and how many times unmapping happens.
In my system, the result shows:

   1. tlb shootdown interrupts are reduced about 97%.
   2. The test program runtime is reduced about 4.5%.

The test environment and the result is like:

   Machine: bare metal, x86_64, Intel(R) Xeon(R) Gold 6430
   CPU: 1 socket 64 core with hyper thread on
   Numa: 2 nodes (64 CPUs DRAM 42GB, no CPUs CXL expander 98GB)
   Config: swap off, numa balancing tiering on, demotion enabled

   The test set:

      llama.cpp/main -m $(70G_model1) -p "who are you?" -s 1 -t 15 -n 20 &
      llama.cpp/main -m $(70G_model2) -p "who are you?" -s 1 -t 15 -n 20 &
      llama.cpp/main -m $(70G_model3) -p "who are you?" -s 1 -t 15 -n 20 &
      wait

      where -t: nr of threads, -s: seed used to make the runtime stable,
      -n: nr of tokens that determines the runtime, -p: prompt to ask,
      -m: LLM model to use.

   Run the test set 5 times successively with caches dropped every run
   via 'echo 3 > /proc/sys/vm/drop_caches'.  Each inference prints its
   runtime at the end of each.

   1. Runtime from the output of llama.cpp:

   BEFORE
   ------
   llama_print_timings:       total time =  883450.54 ms /    24 tokens
   llama_print_timings:       total time =  861665.91 ms /    24 tokens
   llama_print_timings:       total time =  898079.02 ms /    24 tokens
   llama_print_timings:       total time =  879897.69 ms /    24 tokens
   llama_print_timings:       total time =  892360.75 ms /    24 tokens
   llama_print_timings:       total time =  884587.85 ms /    24 tokens
   llama_print_timings:       total time =  861023.19 ms /    24 tokens
   llama_print_timings:       total time =  900022.18 ms /    24 tokens
   llama_print_timings:       total time =  878771.88 ms /    24 tokens
   llama_print_timings:       total time =  889027.98 ms /    24 tokens
   llama_print_timings:       total time =  880783.90 ms /    24 tokens
   llama_print_timings:       total time =  856475.29 ms /    24 tokens
   llama_print_timings:       total time =  896842.21 ms /    24 tokens
   llama_print_timings:       total time =  878883.53 ms /    24 tokens
   llama_print_timings:       total time =  890122.10 ms /    24 tokens

   AFTER
   -----
   llama_print_timings:       total time =  871060.86 ms /    24 tokens
   llama_print_timings:       total time =  825609.53 ms /    24 tokens
   llama_print_timings:       total time =  836854.81 ms /    24 tokens
   llama_print_timings:       total time =  843147.99 ms /    24 tokens
   llama_print_timings:       total time =  831426.65 ms /    24 tokens
   llama_print_timings:       total time =  873939.23 ms /    24 tokens
   llama_print_timings:       total time =  826127.69 ms /    24 tokens
   llama_print_timings:       total time =  835489.26 ms /    24 tokens
   llama_print_timings:       total time =  842589.62 ms /    24 tokens
   llama_print_timings:       total time =  833700.66 ms /    24 tokens
   llama_print_timings:       total time =  875996.19 ms /    24 tokens
   llama_print_timings:       total time =  826401.73 ms /    24 tokens
   llama_print_timings:       total time =  839341.28 ms /    24 tokens
   llama_print_timings:       total time =  841075.10 ms /    24 tokens
   llama_print_timings:       total time =  835136.41 ms /    24 tokens

   2. tlb shootdowns from 'cat /proc/interrupts':

   BEFORE
   ------
   TLB:
    80911532   93691786  100296251  111062810  109769109  109862429
   108968588  119175230  115779676  118377498  119325266  120300143
   124514185  116697222  121068466  118031913  122660681  117494403
   121819907  116960596  120936335  117217061  118630217  122322724
   119595577  111693298  119232201  120030377  115334687  113179982
   118808254  116353592  140987367  137095516  131724276  139742240
   136501150  130428761  127585535  132483981  133430250  133756207
   131786710  126365824  129812539  133850040  131742690  125142213
   128572830  132234350  131945922  128417707  133355434  129972846
   126331823  134050849  133991626  121129038  124637283  132830916
   126875507  122322440  125776487  124340278   TLB shootdowns

   AFTER
   -----
   TLB:
     2121206    2615108    2983494    2911950    3055086    3092672
     3204894    3346082    3286744    3307310    3357296    3315940
     3428034    3112596    3143325    3185551    3186493    3322314
     3330523    3339663    3156064    3272070    3296309    3198962
     3332662    3315870    3234467    3353240    3281234    3300666
     3345452    3173097    4009196    3932215    3898735    3726531
     3717982    3671726    3728788    3724613    3799147    3691764
     3620630    3684655    3666688    3393974    3448651    3487593
     3446357    3618418    3671920    3712949    3575264    3715385
     3641513    3630897    3691047    3630690    3504933    3662647
     3629926    3443044    3832970    3548813   TLB shootdowns

---

Changes from v11:

	1. Rebase on akpm/mm.git mm-unstable(5a7056135b) as of Nov 22, 2024.
	2. Very big changes have been applied including refactoring for
	   better reuse and readability, and other features for
	   stability and performance optimization.  It's hard to list up
	   all changes but lemme list up main changes.
	3. Meta data for pending TLB shootdown is kept in struct zone
	   instead of struct page so as to perform pending TLB shootdown
	   when it's needed in batched manner.
	4. Introduce pend_list in strcut free_area so as to keep luf
	   pages separatedly, that need to perform pending TLB shootdown
	   on returning from buddy.
	5. Control the amount of non-luf pages in buddy, that do not
	   need pending TLB shootdown on returning from buddy, so that
	   any atomic contexts can allocate pages without consideration
	   of TLB shootdown.  Try the best to keep non-luf pages as much
	   as WMARK_MIN for now.
	6. Optimzation to reduce lock contention by locks introduced by
	   luf mechanism.
	7. Implement luf debug feature for detecting when luf goes
	   wrong.  It's just a draft version that need to improve much.
	8. Avoid pending TLB shootdown as much as possible e.g. use
	   non-luf pages in buddy within preempt disabled context if
	   the number of non-luf pages in buddy is fair enough.

Changes from v10:

	1. Rebase on akpm/mm.git mm-unstable as of May 28, 2024.
	2. Cancel LUF on file_end_write() when updating a file.
	   (feedbacked by Dave Hansen)
	3. Cancel LUF after every update_mmu_tlb*() in fault handler if
	   it's going to set the pte to writable.
	4. Cancel LUF on freeing pages if kasan needs poisoning.
	   (feedbacked by David Hildenbrand)
	5. Cancel LUF on freeing pages if want_init_on_free().
	   (feedbacked by David Hildenbrand)
	6. Change test iteration from 10 times to 5 times.
	7. Not include perf result. (I will add it if needed.)
	8. Trivial optimization.	

Changes from v9:

	1. Expand the candidate to apply this mechanism:
	   BEFORE - The souce folios at any type of migration.
	   AFTER  - Any folios that have been unmapped and freed.
	2. Change the workload for test:
	   BEFORE - XSBench
	   AFTER  - llama.cpp (one of the most popluar real workload)
	3. Change the test environment:
	   BEFORE - qemu machine, too small DRAM(1GB), large remote mem
	   AFTER  - bare metal, real CXL memory, practical memory size
	4. Rename the mechanism from MIGRC(Migration Read Copy) to
	   LUF(Lazy Unmap Flush) to reflect the current version of the
	   mechanism can be applied not only to unmap during migration
	   but any unmap code e.g. unmap in shrink_folio_list().
	5. Fix build error for riscv. (feedbacked by kernel test bot)
	6. Supplement commit messages to describe what this mechanism is
	   for, especially in the patches for arch code. (feedbacked by
	   Thomas Gleixner)
	7. Clean up some trivial things.

Changes from v8:

	1. Rebase on akpm/mm.git mm-unstable as of April 18, 2024.
	2. Supplement comments and commit message.
	3. Change the candidate to apply migrc mechanism:
	   BEFORE - The source folios at demotion and promotion.
	   AFTER  - The souce folios at any type of migration.
	4. Change how migrc mechanism works:
	   BEFORE - Reduce tlb flushes by deferring folio_free() for
	            source folios during demotion and promotion.
	   AFTER  - Reduce tlb flushes by deferring tlb flush until they
	            actually become used, out of pcp or buddy. The
		    current version of migrc does *not* defer calling
	            folio_free() but let it go as it is as the same as
		    vanilla kernel, with the folios marked kind of 'need
		    to tlb flush'. And then handle the flush when the
		    page exits from pcp or buddy so as to prevent
		    changing vm stats e.g. free pages.

Changes from v7:

	1. Rewrite cover letter to explain what 'migrc' mechasism is.
	   (feedbacked by Andrew Morton)
	2. Supplement the commit message of a patch 'mm: Add APIs to
	   free a folio directly to the buddy bypassing pcp'.
	   (feedbacked by Andrew Morton)

Changes from v6:

	1. Fix build errors in case of
	   CONFIG_ARCH_WANT_BATCHED_UNMAP_tlb_FLUSH disabled by moving
	   migrc_flush_{start,end}() calls from arch code to
	   try_to_unmap_flush() in mm/rmap.c.

Changes from v5:

	1. Fix build errors in case of CONFIG_MIGRATION disabled or
	   CONFIG_HWPOISON_INJECT moduled. (feedbacked by kernel test
	   bot and Raymond Jay Golo)
	2. Organize migrc code with two kconfigs, CONFIG_MIGRATION and
	   CONFIG_ARCH_WANT_BATCHED_UNMAP_tlb_FLUSH.

Changes from v4:

	1. Rebase on v6.7.
	2. Fix build errors in arm64 that is doing nothing for tlb flush
	   but has CONFIG_ARCH_WANT_BATCHED_UNMAP_tlb_FLUSH. (reported
	   by kernel test robot)
	3. Don't use any page flag. So the system would give up migrc
	   mechanism more often but it's okay. The final improvement is
	   good enough.
	4. Instead, optimize full tlb flush(arch_tlbbatch_flush()) by
	   avoiding redundant CPUs from tlb flush.

Changes from v3:

	1. Don't use the kconfig, CONFIG_MIGRC, and remove sysctl knob,
	   migrc_enable. (feedbacked by Nadav)
	2. Remove the optimization skipping CPUs that have already
	   performed tlb flushes needed by any reason when performing
	   tlb flushes by migrc because I can't tell the performance
	   difference between w/ the optimization and w/o that.
	   (feedbacked by Nadav)
	3. Minimize arch-specific code. While at it, move all the migrc
           declarations and inline functions from include/linux/mm.h to
           mm/internal.h (feedbacked by Dave Hansen, Nadav)
	4. Separate a part making migrc paused when the system is in
	   high memory pressure to another patch. (feedbacked by Nadav)
	5. Rename:
	      a. arch_tlbbatch_clean() to arch_tlbbatch_clear(),
	      b. tlb_ubc_nowr to tlb_ubc_ro,
	      c. migrc_try_flush_free_folios() to migrc_flush_free_folios(),
	      d. migrc_stop to migrc_pause.
	   (feedbacked by Nadav)
	6. Use ->lru list_head instead of introducing a new llist_head.
	   (feedbacked by Nadav)
	7. Use non-atomic operations of page-flag when it's safe.
	   (feedbacked by Nadav)
	8. Use stack instead of keeping a pointer of 'struct migrc_req'
	   in struct task, which is for manipulating it locally.
	   (feedbacked by Nadav)
	9. Replace a lot of simple functions to inline functions placed
	   in a header, mm/internal.h. (feedbacked by Nadav)
	10. Add additional sufficient comments. (feedbacked by Nadav)
	11. Remove a lot of wrapper functions. (feedbacked by Nadav)

Changes from RFC v2:

	1. Remove additional occupation in struct page. To do that,
	   unioned with lru field for migrc's list and added a page
	   flag. I know page flag is a thing that we don't like to add
	   but no choice because migrc should distinguish folios under
	   migrc's control from others. Instead, I force migrc to be
	   used only on 64 bit system to mitigate you guys from getting
	   angry.
	2. Remove meaningless internal object allocator that I
	   introduced to minimize impact onto the system. However, a ton
	   of tests showed there was no difference.
	3. Stop migrc from working when the system is in high memory
	   pressure like about to perform direct reclaim. At the
	   condition where the swap mechanism is heavily used, I found
	   the system suffered from regression without this control.
	4. Exclude folios that pte_dirty() == true from migrc's interest
	   so that migrc can work simpler.
	5. Combine several patches that work tightly coupled to one.
	6. Add sufficient comments for better review.
	7. Manage migrc's request in per-node manner (from globally).
	8. Add tlb miss improvement in commit message.
	9. Test with more CPUs(4 -> 16) to see bigger improvement.

Changes from RFC:

	1. Fix a bug triggered when a destination folio at the previous
	   migration becomes a source folio at the next migration,
	   before the folio gets handled properly so that the folio can
	   play with another migration. There was inconsistency in the
	   folio's state. Fixed it.
	2. Split the patch set into more pieces so that the folks can
	   review better. (Feedbacked by Nadav Amit)
	3. Fix a wrong usage of barrier e.g. smp_mb__after_atomic().
	   (Feedbacked by Nadav Amit)
	4. Tried to add sufficient comments to explain the patch set
	   better. (Feedbacked by Nadav Amit)

Byungchul Park (26):
  x86/tlb: add APIs manipulating tlb batch's arch data
  arm64/tlbflush: add APIs manipulating tlb batch's arch data
  riscv/tlb: add APIs manipulating tlb batch's arch data
  x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of
    arch_tlbbatch_flush()
  mm/buddy: make room for a new variable, luf_key, in struct page
  mm: move should_skip_kasan_poison() to mm/internal.h
  mm: introduce luf_ugen to be used as a global timestamp
  mm: introduce luf_batch to be used as hash table to store luf meta
    data
  mm: introduce API to perform tlb shootdown on exit from page allocator
  mm: introduce APIs to check if the page allocation is tlb
    shootdownable
  mm: deliver luf_key to pcp or buddy on free after unmapping
  mm: delimit critical sections to take off pages from pcp or buddy
    alloctor
  mm: introduce pend_list in struct free_area to track luf'd pages
  mm/rmap: recognize read-only tlb entries during batched tlb flush
  fs, filemap: refactor to gather the scattered ->write_{begin,end}()
    calls
  mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get
    unmapped
  x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown
    that already have been done
  mm/page_alloc: retry 3 times to take pcp pages on luf check failure
  mm: skip luf tlb flush for luf'd mm that already has been done
  mm, fs: skip tlb flushes for luf'd filemap that already has been done
  mm: perform luf tlb shootdown per zone in batched manner
  mm/page_alloc: not allow to tlb shootdown if !preemptable() &&
    non_luf_pages_ok()
  mm: separate move/undo parts from migrate_pages_batch()
  mm/migrate: apply luf mechanism to unmapping during migration
  mm/vmscan: apply luf mechanism to unmapping during folio reclaim
  mm/luf: implement luf debug feature

 arch/arm64/include/asm/tlbflush.h         |  53 ++
 arch/riscv/include/asm/tlbflush.h         |  35 +
 arch/riscv/mm/tlbflush.c                  | 142 +++-
 arch/x86/include/asm/pgtable.h            |  10 +
 arch/x86/include/asm/tlbflush.h           |  31 +
 arch/x86/mm/pgtable.c                     |  10 +
 arch/x86/mm/tlb.c                         | 143 +++-
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c |  11 +-
 fs/affs/file.c                            |   4 +-
 fs/buffer.c                               |  14 +-
 fs/exfat/file.c                           |   5 +-
 fs/ext4/verity.c                          |   5 +-
 fs/f2fs/super.c                           |   5 +-
 fs/f2fs/verity.c                          |   5 +-
 fs/inode.c                                |   1 +
 fs/namei.c                                |   5 +-
 include/asm-generic/tlb.h                 |   5 +
 include/linux/fs.h                        |  30 +
 include/linux/highmem-internal.h          |   5 +
 include/linux/mm.h                        |  94 ++-
 include/linux/mm_types.h                  |  70 +-
 include/linux/mm_types_task.h             |  16 +
 include/linux/mmzone.h                    |  12 +
 include/linux/rmap.h                      |   7 +-
 include/linux/sched.h                     |  24 +
 kernel/fork.c                             |   1 +
 kernel/power/snapshot.c                   |  14 +
 kernel/sched/core.c                       |   1 +
 kernel/vmcore_info.c                      |   2 +
 mm/compaction.c                           |  61 +-
 mm/filemap.c                              |   5 +-
 mm/highmem.c                              |   1 +
 mm/internal.h                             | 196 +++++-
 mm/memory.c                               |  43 ++
 mm/migrate.c                              | 193 ++++--
 mm/mm_init.c                              |   7 +
 mm/page_alloc.c                           | 798 +++++++++++++++++++---
 mm/page_ext.c                             |   3 +
 mm/page_isolation.c                       |  10 +-
 mm/page_reporting.c                       |  45 +-
 mm/pgtable-generic.c                      |   2 +
 mm/rmap.c                                 | 684 ++++++++++++++++++-
 mm/swap.c                                 |   6 +-
 mm/truncate.c                             |  55 +-
 mm/vmscan.c                               |  55 +-
 mm/vmstat.c                               |  15 +
 46 files changed, 2680 insertions(+), 259 deletions(-)


base-commit: 5a7056135bb69da2ce0a42eb8c07968c1331777b
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 01/26] x86/tlb: add APIs manipulating tlb batch's arch data
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 02/26] arm64/tlbflush: " Byungchul Park
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read-only and were unmapped, since
the contents of the folios wouldn't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

This is a preparation for the mechanism that needs to recognize
read-only tlb entries by separating tlb batch arch data into two, one is
for read-only entries and the other is for writable ones, and merging
those two when needed.

It also optimizes tlb shootdown by skipping CPUs that have already
performed tlb flush needed since.  To support it, added APIs
manipulating arch data for x86.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/x86/include/asm/tlbflush.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 69e79fff41b80..0ae9564c7301e 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -5,6 +5,7 @@
 #include <linux/mm_types.h>
 #include <linux/mmu_notifier.h>
 #include <linux/sched.h>
+#include <linux/cpumask.h>
 
 #include <asm/processor.h>
 #include <asm/cpufeature.h>
@@ -293,6 +294,29 @@ static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
+static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
+{
+	cpumask_clear(&batch->cpumask);
+}
+
+static inline void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
+		struct arch_tlbflush_unmap_batch *bsrc)
+{
+	cpumask_or(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+}
+
+static inline bool arch_tlbbatch_need_fold(struct arch_tlbflush_unmap_batch *batch,
+		struct mm_struct *mm)
+{
+	return !cpumask_subset(mm_cpumask(mm), &batch->cpumask);
+}
+
+static inline bool arch_tlbbatch_done(struct arch_tlbflush_unmap_batch *bdst,
+		struct arch_tlbflush_unmap_batch *bsrc)
+{
+	return !cpumask_andnot(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+}
+
 static inline bool pte_flags_need_flush(unsigned long oldflags,
 					unsigned long newflags,
 					bool ignore_access)
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 02/26] arm64/tlbflush: add APIs manipulating tlb batch's arch data
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 01/26] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 03/26] riscv/tlb: " Byungchul Park
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

This is a preparation for the mechanism that requires to manipulate tlb
batch's arch data.  Even though arm64 does nothing for tlb things, arch
with CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH should provide the APIs.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/arm64/include/asm/tlbflush.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 95fbc8c056079..a62e1ea61e4af 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -354,6 +354,33 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	dsb(ish);
 }
 
+static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
+{
+	/* nothing to do */
+}
+
+static inline void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
+			       struct arch_tlbflush_unmap_batch *bsrc)
+{
+	/* nothing to do */
+}
+
+static inline bool arch_tlbbatch_need_fold(struct arch_tlbflush_unmap_batch *batch,
+			       struct mm_struct *mm)
+{
+	/*
+	 * Nothing is needed in this architecture.
+	 */
+	return false;
+}
+
+static inline bool arch_tlbbatch_done(struct arch_tlbflush_unmap_batch *bdst,
+			       struct arch_tlbflush_unmap_batch *bsrc)
+{
+	/* Kernel can consider tlb batch always has been done. */
+	return true;
+}
+
 /*
  * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
  * necessarily a performance improvement.
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 03/26] riscv/tlb: add APIs manipulating tlb batch's arch data
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 01/26] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 02/26] arm64/tlbflush: " Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 04/26] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_flush() Byungchul Park
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

This is a preparation for the mechanism that needs to recognize
read-only tlb entries by separating tlb batch arch data into two, one is
for read-only entries and the other is for writable ones, and merging
those two when needed.

It also optimizes tlb shootdown by skipping CPUs that have already
performed tlb flush needed since.  To support it, added APIs
manipulating arch data for riscv.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/riscv/include/asm/tlbflush.h | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index 72e5599349529..1dc7d30273d59 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -8,6 +8,7 @@
 #define _ASM_RISCV_TLBFLUSH_H
 
 #include <linux/mm_types.h>
+#include <linux/cpumask.h>
 #include <asm/smp.h>
 #include <asm/errata_list.h>
 
@@ -65,6 +66,33 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 void arch_flush_tlb_batched_pending(struct mm_struct *mm);
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
+static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
+{
+	cpumask_clear(&batch->cpumask);
+
+}
+
+static inline void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
+		struct arch_tlbflush_unmap_batch *bsrc)
+{
+	cpumask_or(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+
+}
+
+static inline bool arch_tlbbatch_need_fold(struct arch_tlbflush_unmap_batch *batch,
+		struct mm_struct *mm)
+{
+	return !cpumask_subset(mm_cpumask(mm), &batch->cpumask);
+
+}
+
+static inline bool arch_tlbbatch_done(struct arch_tlbflush_unmap_batch *bdst,
+		struct arch_tlbflush_unmap_batch *bsrc)
+{
+	return !cpumask_andnot(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+
+}
+
 extern unsigned long tlb_flush_all_threshold;
 #else /* CONFIG_MMU */
 #define local_flush_tlb_all()			do { } while (0)
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 04/26] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_flush()
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (2 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 03/26] riscv/tlb: " Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 05/26] mm/buddy: make room for a new variable, luf_key, in struct page Byungchul Park
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

This is a preparation for the mechanism that requires to avoid redundant
tlb flush by manipulating tlb batch's arch data.  To achieve that, we
need to separate the part clearing the tlb batch's arch data out of
arch_tlbbatch_flush().

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/riscv/mm/tlbflush.c | 1 -
 arch/x86/mm/tlb.c        | 2 --
 mm/rmap.c                | 1 +
 3 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 9b6e86ce38674..36f996af6256c 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -201,5 +201,4 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	__flush_tlb_range(&batch->cpumask, FLUSH_TLB_NO_ASID, 0,
 			  FLUSH_TLB_MAX_SIZE, PAGE_SIZE);
-	cpumask_clear(&batch->cpumask);
 }
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 86593d1b787d8..860e49b223fd7 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1262,8 +1262,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 		local_irq_enable();
 	}
 
-	cpumask_clear(&batch->cpumask);
-
 	put_flush_tlb_info();
 	put_cpu();
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index c6c4d4ea29a7e..2de01de164ef0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -648,6 +648,7 @@ void try_to_unmap_flush(void)
 		return;
 
 	arch_tlbbatch_flush(&tlb_ubc->arch);
+	arch_tlbbatch_clear(&tlb_ubc->arch);
 	tlb_ubc->flush_required = false;
 	tlb_ubc->writable = false;
 }
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 05/26] mm/buddy: make room for a new variable, luf_key, in struct page
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (3 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 04/26] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_flush() Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 06/26] mm: move should_skip_kasan_poison() to mm/internal.h Byungchul Park
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
tracks need of tlb flush for each page residing in buddy.

Since the private field in struct page is used only to store page order
in buddy, ranging from 0 to MAX_PAGE_ORDER, that can be covered with
unsigned short.  So splitted it into two smaller ones, order and luf_key,
so that the both can be used in buddy at the same time.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm_types.h | 42 +++++++++++++++++++++++++++++++++-------
 mm/internal.h            |  4 ++--
 mm/page_alloc.c          |  2 +-
 3 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 80fef38d9d645..20d85c4e609de 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -106,13 +106,27 @@ struct page {
 				pgoff_t index;		/* Our offset within mapping. */
 				unsigned long share;	/* share count for fsdax */
 			};
-			/**
-			 * @private: Mapping-private opaque data.
-			 * Usually used for buffer_heads if PagePrivate.
-			 * Used for swp_entry_t if swapcache flag set.
-			 * Indicates order in the buddy system if PageBuddy.
-			 */
-			unsigned long private;
+			union {
+				/**
+				 * @private: Mapping-private opaque data.
+				 * Usually used for buffer_heads if PagePrivate.
+				 * Used for swp_entry_t if swapcache flag set.
+				 * Indicates order in the buddy system if PageBuddy.
+				 */
+				unsigned long private;
+				struct {
+					/*
+					 * Indicates order in the buddy system if PageBuddy.
+					 */
+					unsigned short order;
+
+					/*
+					 * For tracking need of tlb flush,
+					 * by luf(lazy unmap flush).
+					 */
+					unsigned short luf_key;
+				};
+			};
 		};
 		struct {	/* page_pool used by netstack */
 			/**
@@ -537,6 +551,20 @@ static inline void set_page_private(struct page *page, unsigned long private)
 	page->private = private;
 }
 
+#define page_buddy_order(page)		((page)->order)
+
+static inline void set_page_buddy_order(struct page *page, unsigned int order)
+{
+	page->order = (unsigned short)order;
+}
+
+#define page_luf_key(page)		((page)->luf_key)
+
+static inline void set_page_luf_key(struct page *page, unsigned short luf_key)
+{
+	page->luf_key = luf_key;
+}
+
 static inline void *folio_get_private(struct folio *folio)
 {
 	return folio->private;
diff --git a/mm/internal.h b/mm/internal.h
index 5a7302baeed7c..754f1dd763448 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -541,7 +541,7 @@ struct alloc_context {
 static inline unsigned int buddy_order(struct page *page)
 {
 	/* PageBuddy() must be checked by the caller */
-	return page_private(page);
+	return page_buddy_order(page);
 }
 
 /*
@@ -555,7 +555,7 @@ static inline unsigned int buddy_order(struct page *page)
  * times, potentially observing different values in the tests and the actual
  * use of the result.
  */
-#define buddy_order_unsafe(page)	READ_ONCE(page_private(page))
+#define buddy_order_unsafe(page)	READ_ONCE(page_buddy_order(page))
 
 /*
  * This function checks whether a page is free && is the buddy
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 839708353cb77..59c26f59db3d6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -576,7 +576,7 @@ void prep_compound_page(struct page *page, unsigned int order)
 
 static inline void set_buddy_order(struct page *page, unsigned int order)
 {
-	set_page_private(page, order);
+	set_page_buddy_order(page, order);
 	__SetPageBuddy(page);
 }
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 06/26] mm: move should_skip_kasan_poison() to mm/internal.h
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (4 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 05/26] mm/buddy: make room for a new variable, luf_key, in struct page Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 07/26] mm: introduce luf_ugen to be used as a global timestamp Byungchul Park
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
needs to use should_skip_kasan_poison() function in mm/internal.h.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/internal.h   | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c | 47 -----------------------------------------------
 2 files changed, 47 insertions(+), 47 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 754f1dd763448..e3084d32272e3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1038,8 +1038,55 @@ static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
 DECLARE_STATIC_KEY_TRUE(deferred_pages);
 
 bool __init deferred_grow_zone(struct zone *zone, unsigned int order);
+
+static inline bool deferred_pages_enabled(void)
+{
+	return static_branch_unlikely(&deferred_pages);
+}
+#else
+static inline bool deferred_pages_enabled(void)
+{
+	return false;
+}
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
+/*
+ * Skip KASAN memory poisoning when either:
+ *
+ * 1. For generic KASAN: deferred memory initialization has not yet completed.
+ *    Tag-based KASAN modes skip pages freed via deferred memory initialization
+ *    using page tags instead (see below).
+ * 2. For tag-based KASAN modes: the page has a match-all KASAN tag, indicating
+ *    that error detection is disabled for accesses via the page address.
+ *
+ * Pages will have match-all tags in the following circumstances:
+ *
+ * 1. Pages are being initialized for the first time, including during deferred
+ *    memory init; see the call to page_kasan_tag_reset in __init_single_page.
+ * 2. The allocation was not unpoisoned due to __GFP_SKIP_KASAN, with the
+ *    exception of pages unpoisoned by kasan_unpoison_vmalloc.
+ * 3. The allocation was excluded from being checked due to sampling,
+ *    see the call to kasan_unpoison_pages.
+ *
+ * Poisoning pages during deferred memory init will greatly lengthen the
+ * process and cause problem in large memory systems as the deferred pages
+ * initialization is done with interrupt disabled.
+ *
+ * Assuming that there will be no reference to those newly initialized
+ * pages before they are ever allocated, this should have no effect on
+ * KASAN memory tracking as the poison will be properly inserted at page
+ * allocation time. The only corner case is when pages are allocated by
+ * on-demand allocation and then freed again before the deferred pages
+ * initialization is done, but this is not likely to happen.
+ */
+static inline bool should_skip_kasan_poison(struct page *page)
+{
+	if (IS_ENABLED(CONFIG_KASAN_GENERIC))
+		return deferred_pages_enabled();
+
+	return page_kasan_tag(page) == KASAN_TAG_KERNEL;
+}
+
 enum mminit_level {
 	MMINIT_WARNING,
 	MMINIT_VERIFY,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59c26f59db3d6..244cb30496be5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -299,11 +299,6 @@ int page_group_by_mobility_disabled __read_mostly;
  */
 DEFINE_STATIC_KEY_TRUE(deferred_pages);
 
-static inline bool deferred_pages_enabled(void)
-{
-	return static_branch_unlikely(&deferred_pages);
-}
-
 /*
  * deferred_grow_zone() is __init, but it is called from
  * get_page_from_freelist() during early boot until deferred_pages permanently
@@ -316,11 +311,6 @@ _deferred_grow_zone(struct zone *zone, unsigned int order)
 	return deferred_grow_zone(zone, order);
 }
 #else
-static inline bool deferred_pages_enabled(void)
-{
-	return false;
-}
-
 static inline bool _deferred_grow_zone(struct zone *zone, unsigned int order)
 {
 	return false;
@@ -993,43 +983,6 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 	return ret;
 }
 
-/*
- * Skip KASAN memory poisoning when either:
- *
- * 1. For generic KASAN: deferred memory initialization has not yet completed.
- *    Tag-based KASAN modes skip pages freed via deferred memory initialization
- *    using page tags instead (see below).
- * 2. For tag-based KASAN modes: the page has a match-all KASAN tag, indicating
- *    that error detection is disabled for accesses via the page address.
- *
- * Pages will have match-all tags in the following circumstances:
- *
- * 1. Pages are being initialized for the first time, including during deferred
- *    memory init; see the call to page_kasan_tag_reset in __init_single_page.
- * 2. The allocation was not unpoisoned due to __GFP_SKIP_KASAN, with the
- *    exception of pages unpoisoned by kasan_unpoison_vmalloc.
- * 3. The allocation was excluded from being checked due to sampling,
- *    see the call to kasan_unpoison_pages.
- *
- * Poisoning pages during deferred memory init will greatly lengthen the
- * process and cause problem in large memory systems as the deferred pages
- * initialization is done with interrupt disabled.
- *
- * Assuming that there will be no reference to those newly initialized
- * pages before they are ever allocated, this should have no effect on
- * KASAN memory tracking as the poison will be properly inserted at page
- * allocation time. The only corner case is when pages are allocated by
- * on-demand allocation and then freed again before the deferred pages
- * initialization is done, but this is not likely to happen.
- */
-static inline bool should_skip_kasan_poison(struct page *page)
-{
-	if (IS_ENABLED(CONFIG_KASAN_GENERIC))
-		return deferred_pages_enabled();
-
-	return page_kasan_tag(page) == KASAN_TAG_KERNEL;
-}
-
 static void kernel_init_pages(struct page *page, int numpages)
 {
 	int i;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 07/26] mm: introduce luf_ugen to be used as a global timestamp
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (5 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 06/26] mm: move should_skip_kasan_poison() to mm/internal.h Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 08/26] mm: introduce luf_batch to be used as hash table to store luf meta data Byungchul Park
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
needs to evaluate the temporal sequence of events to determine whether
tlb flush required has been done on each CPU.

To achieve that, this patch introduced a generation number, luf_ugen,
and a few APIs manipulating the number.  It's worth noting the number is
designed to wraparound so care must be taken when using it.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm.h | 34 ++++++++++++++++++++++++++++++++++
 mm/rmap.c          | 22 ++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fecd47239fa99..53a5f1cb21e0d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4161,4 +4161,38 @@ static inline int do_mseal(unsigned long start, size_t len_in, unsigned long fla
 }
 #endif
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+/*
+ * luf_ugen will start with 2 so that 1 can be regarded as a passed one.
+ */
+#define LUF_UGEN_INIT 2
+
+static inline bool ugen_before(unsigned long a, unsigned long b)
+{
+	/*
+	 * Consider wraparound.
+	 */
+	return (long)(a - b) < 0;
+}
+
+static inline unsigned long next_ugen(unsigned long ugen)
+{
+	if (ugen + 1)
+		return ugen + 1;
+	/*
+	 * Avoid invalid ugen, zero.
+	 */
+	return ugen + 2;
+}
+
+static inline unsigned long prev_ugen(unsigned long ugen)
+{
+	if (ugen - 1)
+		return ugen - 1;
+	/*
+	 * Avoid invalid ugen, zero.
+	 */
+	return ugen - 2;
+}
+#endif
 #endif /* _LINUX_MM_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index 2de01de164ef0..ed345503e4f88 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -634,6 +634,28 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
 }
 
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
+/*
+ * This generation number is primarily used as a global timestamp to
+ * determine whether tlb flush required has been done on each CPU.  The
+ * function, ugen_before(), should be used to evaluate the temporal
+ * sequence of events because the number is designed to wraparound.
+ */
+static atomic_long_t __maybe_unused luf_ugen = ATOMIC_LONG_INIT(LUF_UGEN_INIT);
+
+/*
+ * Don't return invalid luf_ugen, zero.
+ */
+static unsigned long __maybe_unused new_luf_ugen(void)
+{
+	unsigned long ugen = atomic_long_inc_return(&luf_ugen);
+
+	if (!ugen)
+		ugen = atomic_long_inc_return(&luf_ugen);
+
+	return ugen;
+}
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 08/26] mm: introduce luf_batch to be used as hash table to store luf meta data
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (6 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 07/26] mm: introduce luf_ugen to be used as a global timestamp Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 09/26] mm: introduce API to perform tlb shootdown on exit from page allocator Byungchul Park
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
needs to keep luf meta data per page while staying in pcp or buddy
allocator.  The meta data includes cpumask for tlb shootdown and luf's
request generation number.

Since struct page doesn't have enough room to store luf meta data, this
patch introduces a hash table to store them and makes each page keep its
hash key instead.

Since all the pages in pcp or buddy share the hash table, confliction is
inevitable so care must be taken when reading or updating its entry.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm_types.h |  10 ++++
 mm/internal.h            |   8 +++
 mm/rmap.c                | 122 +++++++++++++++++++++++++++++++++++++--
 3 files changed, 136 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 20d85c4e609de..39a6b5124b01f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -32,6 +32,16 @@
 struct address_space;
 struct mem_cgroup;
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+struct luf_batch {
+	struct tlbflush_unmap_batch batch;
+	unsigned long ugen;
+	rwlock_t lock;
+};
+#else
+struct luf_batch {};
+#endif
+
 /*
  * Each physical page in the system has a struct page associated with
  * it to keep track of whatever it is we are using the page for at the
diff --git a/mm/internal.h b/mm/internal.h
index e3084d32272e3..b38a9ae9d6993 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1240,6 +1240,8 @@ extern struct workqueue_struct *mm_percpu_wq;
 void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
 void flush_tlb_batched_pending(struct mm_struct *mm);
+void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset);
+void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src);
 #else
 static inline void try_to_unmap_flush(void)
 {
@@ -1250,6 +1252,12 @@ static inline void try_to_unmap_flush_dirty(void)
 static inline void flush_tlb_batched_pending(struct mm_struct *mm)
 {
 }
+static inline void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset)
+{
+}
+static inline void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
+{
+}
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
 extern const struct trace_print_flags pageflag_names[];
diff --git a/mm/rmap.c b/mm/rmap.c
index ed345503e4f88..74fbf6c2fb3a7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -641,7 +641,7 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
  * function, ugen_before(), should be used to evaluate the temporal
  * sequence of events because the number is designed to wraparound.
  */
-static atomic_long_t __maybe_unused luf_ugen = ATOMIC_LONG_INIT(LUF_UGEN_INIT);
+static atomic_long_t luf_ugen = ATOMIC_LONG_INIT(LUF_UGEN_INIT);
 
 /*
  * Don't return invalid luf_ugen, zero.
@@ -656,6 +656,122 @@ static unsigned long __maybe_unused new_luf_ugen(void)
 	return ugen;
 }
 
+static void reset_batch(struct tlbflush_unmap_batch *batch)
+{
+	arch_tlbbatch_clear(&batch->arch);
+	batch->flush_required = false;
+	batch->writable = false;
+}
+
+void fold_batch(struct tlbflush_unmap_batch *dst,
+		struct tlbflush_unmap_batch *src, bool reset)
+{
+	if (!src->flush_required)
+		return;
+
+	/*
+	 * Fold src to dst.
+	 */
+	arch_tlbbatch_fold(&dst->arch, &src->arch);
+	dst->writable = dst->writable || src->writable;
+	dst->flush_required = true;
+
+	if (!reset)
+		return;
+
+	/*
+	 * Reset src.
+	 */
+	reset_batch(src);
+}
+
+/*
+ * The range that luf_key covers, which is 'unsigned short' type.
+ */
+#define NR_LUF_BATCH (1 << (sizeof(short) * 8))
+
+/*
+ * Use 0th entry as accumulated batch.
+ */
+static struct luf_batch luf_batch[NR_LUF_BATCH];
+
+static void luf_batch_init(struct luf_batch *lb)
+{
+	rwlock_init(&lb->lock);
+	reset_batch(&lb->batch);
+	lb->ugen = atomic_long_read(&luf_ugen) - 1;
+}
+
+static int __init luf_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_LUF_BATCH; i++)
+		luf_batch_init(&luf_batch[i]);
+
+	return 0;
+}
+early_initcall(luf_init);
+
+/*
+ * key to point an entry of the luf_batch array
+ *
+ * note: zero means invalid key
+ */
+static atomic_t luf_kgen = ATOMIC_INIT(1);
+
+/*
+ * Don't return invalid luf_key, zero.
+ */
+static unsigned short __maybe_unused new_luf_key(void)
+{
+	unsigned short luf_key = atomic_inc_return(&luf_kgen);
+
+	if (!luf_key)
+		luf_key = atomic_inc_return(&luf_kgen);
+
+	return luf_key;
+}
+
+static void __fold_luf_batch(struct luf_batch *dst_lb,
+		struct tlbflush_unmap_batch *src_batch,
+		unsigned long src_ugen)
+{
+	/*
+	 * dst_lb->ugen represents one that requires tlb shootdown for
+	 * it, that is, sort of request number.  The newer it is, the
+	 * more tlb shootdown might be needed to fulfill the newer
+	 * request.  Conservertively keep the newer one.
+	 */
+	if (!dst_lb->ugen || ugen_before(dst_lb->ugen, src_ugen))
+		dst_lb->ugen = src_ugen;
+	fold_batch(&dst_lb->batch, src_batch, false);
+}
+
+void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
+{
+	unsigned long flags;
+
+	/*
+	 * Exactly same.  Nothing to fold.
+	 */
+	if (dst == src)
+		return;
+
+	if (&src->lock < &dst->lock) {
+		read_lock_irqsave(&src->lock, flags);
+		write_lock(&dst->lock);
+	} else {
+		write_lock_irqsave(&dst->lock, flags);
+		read_lock(&src->lock);
+	}
+
+	__fold_luf_batch(dst, &src->batch, src->ugen);
+
+	write_unlock(&dst->lock);
+	read_unlock_irqrestore(&src->lock, flags);
+}
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
@@ -670,9 +786,7 @@ void try_to_unmap_flush(void)
 		return;
 
 	arch_tlbbatch_flush(&tlb_ubc->arch);
-	arch_tlbbatch_clear(&tlb_ubc->arch);
-	tlb_ubc->flush_required = false;
-	tlb_ubc->writable = false;
+	reset_batch(tlb_ubc);
 }
 
 /* Flush iff there are potentially writable TLB entries that can race with IO */
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 09/26] mm: introduce API to perform tlb shootdown on exit from page allocator
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (7 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 08/26] mm: introduce luf_batch to be used as hash table to store luf meta data Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 10/26] mm: introduce APIs to check if the page allocation is tlb shootdownable Byungchul Park
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
performs tlb shootdown required on exit from page allocator.

This patch introduced a new API rather than making use of existing
try_to_unmap_flush() to avoid repeated and redundant tlb shootdown due
to frequent page allocations during a session of batched unmap flush.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/sched.h |  1 +
 mm/internal.h         |  4 ++++
 mm/rmap.c             | 20 ++++++++++++++++++++
 3 files changed, 25 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb343136ddd05..8e6e7a83332cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1375,6 +1375,7 @@ struct task_struct {
 #endif
 
 	struct tlbflush_unmap_batch	tlb_ubc;
+	struct tlbflush_unmap_batch	tlb_ubc_takeoff;
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/mm/internal.h b/mm/internal.h
index b38a9ae9d6993..cbdebf8a02437 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1239,6 +1239,7 @@ extern struct workqueue_struct *mm_percpu_wq;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
+void try_to_unmap_flush_takeoff(void);
 void flush_tlb_batched_pending(struct mm_struct *mm);
 void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset);
 void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src);
@@ -1249,6 +1250,9 @@ static inline void try_to_unmap_flush(void)
 static inline void try_to_unmap_flush_dirty(void)
 {
 }
+static inline void try_to_unmap_flush_takeoff(void)
+{
+}
 static inline void flush_tlb_batched_pending(struct mm_struct *mm)
 {
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 74fbf6c2fb3a7..72c5e665e59a4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -772,6 +772,26 @@ void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
 	read_unlock_irqrestore(&src->lock, flags);
 }
 
+void try_to_unmap_flush_takeoff(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
+
+	if (!tlb_ubc_takeoff->flush_required)
+		return;
+
+	arch_tlbbatch_flush(&tlb_ubc_takeoff->arch);
+
+	/*
+	 * Now that tlb shootdown of tlb_ubc_takeoff has been performed,
+	 * it's good chance to shrink tlb_ubc if possible.
+	 */
+	if (arch_tlbbatch_done(&tlb_ubc->arch, &tlb_ubc_takeoff->arch))
+		reset_batch(tlb_ubc);
+
+	reset_batch(tlb_ubc_takeoff);
+}
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 10/26] mm: introduce APIs to check if the page allocation is tlb shootdownable
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (8 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 09/26] mm: introduce API to perform tlb shootdown on exit from page allocator Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 11/26] mm: deliver luf_key to pcp or buddy on free after unmapping Byungchul Park
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
should indentify if tlb shootdown can be performed on page allocation.

In a context with irq disabled or non-task, tlb shootdown cannot be
performed because of deadlock issue.  Thus, page allocator should work
being aware of whether tlb shootdown can be performed on returning page.

This patch introduced APIs that pcp or buddy page allocator can use to
delimit the critical sections taking off pages and indentify whether
tlb shootdown can be performed.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/sched.h |   5 ++
 mm/internal.h         |  14 ++++
 mm/page_alloc.c       | 159 ++++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c             |   2 +-
 4 files changed, 179 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8e6e7a83332cf..c4ff83e1d5953 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1374,6 +1374,11 @@ struct task_struct {
 	struct callback_head		cid_work;
 #endif
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+	int luf_no_shootdown;
+	int luf_takeoff_started;
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 	struct tlbflush_unmap_batch	tlb_ubc_takeoff;
 
diff --git a/mm/internal.h b/mm/internal.h
index cbdebf8a02437..55bc8ca0d6118 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1583,6 +1583,20 @@ static inline void accept_page(struct page *page)
 {
 }
 #endif /* CONFIG_UNACCEPTED_MEMORY */
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+extern struct luf_batch luf_batch[];
+bool luf_takeoff_start(void);
+void luf_takeoff_end(void);
+bool luf_takeoff_no_shootdown(void);
+bool luf_takeoff_check(struct page *page);
+bool luf_takeoff_check_and_fold(struct page *page);
+#else
+static inline bool luf_takeoff_start(void) { return false; }
+static inline void luf_takeoff_end(void) {}
+static inline bool luf_takeoff_no_shootdown(void) { return true; }
+static inline bool luf_takeoff_check(struct page *page) { return true; }
+static inline bool luf_takeoff_check_and_fold(struct page *page) { return true; }
+#endif
 
 /* pagewalk.c */
 int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 244cb30496be5..cac2c95ca2430 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -622,6 +622,165 @@ compaction_capture(struct capture_control *capc, struct page *page,
 }
 #endif /* CONFIG_COMPACTION */
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+static bool no_shootdown_context(void)
+{
+	/*
+	 * If it performs with irq disabled, that might cause a deadlock.
+	 * Avoid tlb shootdown in this case.
+	 */
+	return !(!irqs_disabled() && in_task());
+}
+
+/*
+ * Can be called with zone lock released and irq enabled.
+ */
+bool luf_takeoff_start(void)
+{
+	unsigned long flags;
+	bool no_shootdown = no_shootdown_context();
+
+	local_irq_save(flags);
+
+	/*
+	 * It's the outmost luf_takeoff_start().
+	 */
+	if (!current->luf_takeoff_started)
+		VM_WARN_ON(current->luf_no_shootdown);
+
+	/*
+	 * current->luf_no_shootdown > 0 doesn't mean tlb shootdown is
+	 * not allowed at all.  However, it guarantees tlb shootdown is
+	 * possible once current->luf_no_shootdown == 0.  It might look
+	 * too conservative but for now do this way for simplity.
+	 */
+	if (no_shootdown || current->luf_no_shootdown)
+		current->luf_no_shootdown++;
+
+	current->luf_takeoff_started++;
+	local_irq_restore(flags);
+
+	return !no_shootdown;
+}
+
+/*
+ * Should be called within the same context of luf_takeoff_start().
+ */
+void luf_takeoff_end(void)
+{
+	unsigned long flags;
+	bool no_shootdown;
+	bool outmost = false;
+
+	local_irq_save(flags);
+	VM_WARN_ON(!current->luf_takeoff_started);
+
+	/*
+	 * Assume the context and irq flags are same as those at
+	 * luf_takeoff_start().
+	 */
+	if (current->luf_no_shootdown)
+		current->luf_no_shootdown--;
+
+	no_shootdown = !!current->luf_no_shootdown;
+
+	current->luf_takeoff_started--;
+
+	/*
+	 * It's the outmost luf_takeoff_end().
+	 */
+	if (!current->luf_takeoff_started)
+		outmost = true;
+
+	local_irq_restore(flags);
+
+	if (no_shootdown)
+		goto out;
+
+	try_to_unmap_flush_takeoff();
+out:
+	if (outmost)
+		VM_WARN_ON(current->luf_no_shootdown);
+}
+
+/*
+ * Can be called with zone lock released and irq enabled.
+ */
+bool luf_takeoff_no_shootdown(void)
+{
+	bool no_shootdown = true;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	/*
+	 * No way.  Delimit using luf_takeoff_{start,end}().
+	 */
+	if (unlikely(!current->luf_takeoff_started)) {
+		VM_WARN_ON(1);
+		goto out;
+	}
+	no_shootdown = current->luf_no_shootdown;
+out:
+	local_irq_restore(flags);
+	return no_shootdown;
+}
+
+/*
+ * Should be called with either zone lock held and irq disabled or pcp
+ * lock held.
+ */
+bool luf_takeoff_check(struct page *page)
+{
+	unsigned short luf_key = page_luf_key(page);
+
+	/*
+	 * No way.  Delimit using luf_takeoff_{start,end}().
+	 */
+	if (unlikely(!current->luf_takeoff_started)) {
+		VM_WARN_ON(1);
+		return false;
+	}
+
+	if (!luf_key)
+		return true;
+
+	return !current->luf_no_shootdown;
+}
+
+/*
+ * Should be called with either zone lock held and irq disabled or pcp
+ * lock held.
+ */
+bool luf_takeoff_check_and_fold(struct page *page)
+{
+	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
+	unsigned short luf_key = page_luf_key(page);
+	struct luf_batch *lb;
+	unsigned long flags;
+
+	/*
+	 * No way.  Delimit using luf_takeoff_{start,end}().
+	 */
+	if (unlikely(!current->luf_takeoff_started)) {
+		VM_WARN_ON(1);
+		return false;
+	}
+
+	if (!luf_key)
+		return true;
+
+	if (current->luf_no_shootdown)
+		return false;
+
+	lb = &luf_batch[luf_key];
+	read_lock_irqsave(&lb->lock, flags);
+	fold_batch(tlb_ubc_takeoff, &lb->batch, false);
+	read_unlock_irqrestore(&lb->lock, flags);
+	return true;
+}
+#endif
+
 static inline void account_freepages(struct zone *zone, int nr_pages,
 				     int migratetype)
 {
diff --git a/mm/rmap.c b/mm/rmap.c
index 72c5e665e59a4..1581b1a00f974 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -693,7 +693,7 @@ void fold_batch(struct tlbflush_unmap_batch *dst,
 /*
  * Use 0th entry as accumulated batch.
  */
-static struct luf_batch luf_batch[NR_LUF_BATCH];
+struct luf_batch luf_batch[NR_LUF_BATCH];
 
 static void luf_batch_init(struct luf_batch *lb)
 {
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 11/26] mm: deliver luf_key to pcp or buddy on free after unmapping
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (9 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 10/26] mm: introduce APIs to check if the page allocation is tlb shootdownable Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 12/26] mm: delimit critical sections to take off pages from pcp or buddy alloctor Byungchul Park
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
needs to pass luf_key to pcp or buddy allocator on free after unmapping
e.g. during page reclaim or page migration.

The luf_key will be used to track need of tlb shootdown and which cpus
need to perform tlb flush, per page residing in pcp or buddy, and should
be handed over properly when pages travel between pcp and buddy.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/internal.h       |   4 +-
 mm/page_alloc.c     | 120 ++++++++++++++++++++++++++++++++------------
 mm/page_isolation.c |   6 +++
 mm/page_reporting.c |   6 +++
 mm/swap.c           |   4 +-
 mm/vmscan.c         |   8 +--
 6 files changed, 109 insertions(+), 39 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 55bc8ca0d6118..2bb54bc04260b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -741,8 +741,8 @@ extern bool free_pages_prepare(struct page *page, unsigned int order);
 
 extern int user_min_free_kbytes;
 
-void free_unref_page(struct page *page, unsigned int order);
-void free_unref_folios(struct folio_batch *fbatch);
+void free_unref_page(struct page *page, unsigned int order, unsigned short luf_key);
+void free_unref_folios(struct folio_batch *fbatch, unsigned short luf_key);
 
 extern void zone_pcp_reset(struct zone *zone);
 extern void zone_pcp_disable(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cac2c95ca2430..05a1098f8c61f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -212,7 +212,7 @@ unsigned int pageblock_order __read_mostly;
 #endif
 
 static void __free_pages_ok(struct page *page, unsigned int order,
-			    fpi_t fpi_flags);
+			    fpi_t fpi_flags, unsigned short luf_key);
 
 /*
  * results with 256, 32 in the lowmem_reserve sysctl:
@@ -850,8 +850,13 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
 
 	list_del(&page->buddy_list);
 	__ClearPageBuddy(page);
-	set_page_private(page, 0);
 	zone->free_area[order].nr_free--;
+
+	/*
+	 * Keep head page's private until post_alloc_hook().
+	 *
+	 * XXX: Tail pages' private doesn't get cleared.
+	 */
 }
 
 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
@@ -920,7 +925,7 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
 static inline void __free_one_page(struct page *page,
 		unsigned long pfn,
 		struct zone *zone, unsigned int order,
-		int migratetype, fpi_t fpi_flags)
+		int migratetype, fpi_t fpi_flags, unsigned short luf_key)
 {
 	struct capture_control *capc = task_capc(zone);
 	unsigned long buddy_pfn = 0;
@@ -937,10 +942,21 @@ static inline void __free_one_page(struct page *page,
 
 	account_freepages(zone, 1 << order, migratetype);
 
+	/*
+	 * Use the page's luf_key unchanged if luf_key == 0.  Worth
+	 * noting that page_luf_key() will be 0 in most cases since it's
+	 * initialized at free_pages_prepare().
+	 */
+	if (luf_key)
+		set_page_luf_key(page, luf_key);
+	else
+		luf_key = page_luf_key(page);
+
 	while (order < MAX_PAGE_ORDER) {
 		int buddy_mt = migratetype;
+		unsigned short buddy_luf_key;
 
-		if (compaction_capture(capc, page, order, migratetype)) {
+		if (!luf_key && compaction_capture(capc, page, order, migratetype)) {
 			account_freepages(zone, -(1 << order), migratetype);
 			return;
 		}
@@ -973,6 +989,18 @@ static inline void __free_one_page(struct page *page,
 		else
 			__del_page_from_free_list(buddy, zone, order, buddy_mt);
 
+		/*
+		 * !buddy_luf_key && !luf_key : do nothing
+		 *  buddy_luf_key && !luf_key : luf_key = buddy_luf_key
+		 * !buddy_luf_key &&  luf_key : do nothing
+		 *  buddy_luf_key &&  luf_key : merge two into luf_key
+		 */
+		buddy_luf_key = page_luf_key(buddy);
+		if (buddy_luf_key && !luf_key)
+			luf_key = buddy_luf_key;
+		else if (buddy_luf_key && luf_key)
+			fold_luf_batch(&luf_batch[luf_key], &luf_batch[buddy_luf_key]);
+
 		if (unlikely(buddy_mt != migratetype)) {
 			/*
 			 * Match buddy type. This ensures that an
@@ -984,6 +1012,7 @@ static inline void __free_one_page(struct page *page,
 
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
+		set_page_luf_key(page, luf_key);
 		pfn = combined_pfn;
 		order++;
 	}
@@ -1164,6 +1193,11 @@ __always_inline bool free_pages_prepare(struct page *page,
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
+	/*
+	 * Ensure private is zero before using it inside allocator.
+	 */
+	set_page_private(page, 0);
+
 	trace_mm_page_free(page, order);
 	kmsan_free_page(page, order);
 
@@ -1329,7 +1363,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			count -= nr_pages;
 			pcp->count -= nr_pages;
 
-			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
+			__free_one_page(page, pfn, zone, order, mt, FPI_NONE, 0);
+
 			trace_mm_page_pcpu_drain(page, order, mt);
 		} while (count > 0 && !list_empty(list));
 	}
@@ -1353,7 +1388,7 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 	while (pfn != end) {
 		int mt = get_pfnblock_migratetype(page, pfn);
 
-		__free_one_page(page, pfn, zone, order, mt, fpi);
+		__free_one_page(page, pfn, zone, order, mt, fpi, 0);
 		pfn += 1 << order;
 		page = pfn_to_page(pfn);
 	}
@@ -1361,11 +1396,18 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 
 static void free_one_page(struct zone *zone, struct page *page,
 			  unsigned long pfn, unsigned int order,
-			  fpi_t fpi_flags)
+			  fpi_t fpi_flags, unsigned short luf_key)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&zone->lock, flags);
+
+	/*
+	 * valid luf_key can be passed only if order == 0.
+	 */
+	VM_WARN_ON(luf_key && order);
+	set_page_luf_key(page, luf_key);
+
 	split_large_buddy(zone, page, pfn, order, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
@@ -1373,13 +1415,13 @@ static void free_one_page(struct zone *zone, struct page *page,
 }
 
 static void __free_pages_ok(struct page *page, unsigned int order,
-			    fpi_t fpi_flags)
+			    fpi_t fpi_flags, unsigned short luf_key)
 {
 	unsigned long pfn = page_to_pfn(page);
 	struct zone *zone = page_zone(page);
 
 	if (free_pages_prepare(page, order))
-		free_one_page(zone, page, pfn, order, fpi_flags);
+		free_one_page(zone, page, pfn, order, fpi_flags, luf_key);
 }
 
 void __meminit __free_pages_core(struct page *page, unsigned int order,
@@ -1433,7 +1475,7 @@ void __meminit __free_pages_core(struct page *page, unsigned int order,
 	 * Bypass PCP and place fresh pages right to the tail, primarily
 	 * relevant for memory onlining.
 	 */
-	__free_pages_ok(page, order, FPI_TO_TAIL);
+	__free_pages_ok(page, order, FPI_TO_TAIL, 0);
 }
 
 /*
@@ -2459,6 +2501,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		if (unlikely(page == NULL))
 			break;
 
+		/*
+		 * Keep the page's luf_key.
+		 */
+
 		/*
 		 * Split buddy pages returned by expand() are received here in
 		 * physical page order. The page is added to the tail of
@@ -2740,12 +2786,14 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 
 static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 				   struct page *page, int migratetype,
-				   unsigned int order)
+				   unsigned int order, unsigned short luf_key)
 {
 	int high, batch;
 	int pindex;
 	bool free_high = false;
 
+	set_page_luf_key(page, luf_key);
+
 	/*
 	 * On freeing, reduce the number of pages that are batch allocated.
 	 * See nr_pcp_alloc() where alloc_factor is increased for subsequent
@@ -2754,7 +2802,16 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	pcp->alloc_factor >>= 1;
 	__count_vm_events(PGFREE, 1 << order);
 	pindex = order_to_pindex(migratetype, order);
-	list_add(&page->pcp_list, &pcp->lists[pindex]);
+
+	/*
+	 * Defer tlb shootdown as much as possible by putting luf'd
+	 * pages to the tail.
+	 */
+	if (luf_key)
+		list_add_tail(&page->pcp_list, &pcp->lists[pindex]);
+	else
+		list_add(&page->pcp_list, &pcp->lists[pindex]);
+
 	pcp->count += 1 << order;
 
 	batch = READ_ONCE(pcp->batch);
@@ -2789,7 +2846,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 /*
  * Free a pcp page
  */
-void free_unref_page(struct page *page, unsigned int order)
+void free_unref_page(struct page *page, unsigned int order,
+		     unsigned short luf_key)
 {
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp;
@@ -2798,7 +2856,7 @@ void free_unref_page(struct page *page, unsigned int order)
 	int migratetype;
 
 	if (!pcp_allowed_order(order)) {
-		__free_pages_ok(page, order, FPI_NONE);
+		__free_pages_ok(page, order, FPI_NONE, luf_key);
 		return;
 	}
 
@@ -2815,7 +2873,7 @@ void free_unref_page(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(page_zone(page), page, pfn, order, FPI_NONE);
+			free_one_page(page_zone(page), page, pfn, order, FPI_NONE, luf_key);
 			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
@@ -2825,10 +2883,10 @@ void free_unref_page(struct page *page, unsigned int order)
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
-		free_unref_page_commit(zone, pcp, page, migratetype, order);
+		free_unref_page_commit(zone, pcp, page, migratetype, order, luf_key);
 		pcp_spin_unlock(pcp);
 	} else {
-		free_one_page(zone, page, pfn, order, FPI_NONE);
+		free_one_page(zone, page, pfn, order, FPI_NONE, luf_key);
 	}
 	pcp_trylock_finish(UP_flags);
 }
@@ -2836,7 +2894,7 @@ void free_unref_page(struct page *page, unsigned int order)
 /*
  * Free a batch of folios
  */
-void free_unref_folios(struct folio_batch *folios)
+void free_unref_folios(struct folio_batch *folios, unsigned short luf_key)
 {
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp = NULL;
@@ -2857,7 +2915,7 @@ void free_unref_folios(struct folio_batch *folios)
 		 */
 		if (!pcp_allowed_order(order)) {
 			free_one_page(folio_zone(folio), &folio->page,
-				      pfn, order, FPI_NONE);
+				      pfn, order, FPI_NONE, luf_key);
 			continue;
 		}
 		folio->private = (void *)(unsigned long)order;
@@ -2893,7 +2951,7 @@ void free_unref_folios(struct folio_batch *folios)
 			 */
 			if (is_migrate_isolate(migratetype)) {
 				free_one_page(zone, &folio->page, pfn,
-					      order, FPI_NONE);
+					      order, FPI_NONE, luf_key);
 				continue;
 			}
 
@@ -2906,7 +2964,7 @@ void free_unref_folios(struct folio_batch *folios)
 			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
 				free_one_page(zone, &folio->page, pfn,
-					      order, FPI_NONE);
+					      order, FPI_NONE, luf_key);
 				continue;
 			}
 			locked_zone = zone;
@@ -2921,7 +2979,7 @@ void free_unref_folios(struct folio_batch *folios)
 
 		trace_mm_page_free_batched(&folio->page);
 		free_unref_page_commit(zone, pcp, &folio->page, migratetype,
-				order);
+				order, luf_key);
 	}
 
 	if (pcp) {
@@ -3013,7 +3071,7 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
 
 	/* Return isolated page to tail of freelist. */
 	__free_one_page(page, page_to_pfn(page), zone, order, mt,
-			FPI_SKIP_REPORT_NOTIFY | FPI_TO_TAIL);
+			FPI_SKIP_REPORT_NOTIFY | FPI_TO_TAIL, 0);
 }
 
 /*
@@ -4983,11 +5041,11 @@ void __free_pages(struct page *page, unsigned int order)
 	struct alloc_tag *tag = pgalloc_tag_get(page);
 
 	if (put_page_testzero(page))
-		free_unref_page(page, order);
+		free_unref_page(page, order, 0);
 	else if (!head) {
 		pgalloc_tag_sub_pages(tag, (1 << order) - 1);
 		while (order-- > 0)
-			free_unref_page(page + (1 << order), order);
+			free_unref_page(page + (1 << order), order, 0);
 	}
 }
 EXPORT_SYMBOL(__free_pages);
@@ -5049,7 +5107,7 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
 
 	if (page_ref_sub_and_test(page, count))
-		free_unref_page(page, compound_order(page));
+		free_unref_page(page, compound_order(page), 0);
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
@@ -5090,7 +5148,7 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
 			goto refill;
 
 		if (unlikely(nc->pfmemalloc)) {
-			free_unref_page(page, compound_order(page));
+			free_unref_page(page, compound_order(page), 0);
 			goto refill;
 		}
 
@@ -5134,7 +5192,7 @@ void page_frag_free(void *addr)
 	struct page *page = virt_to_head_page(addr);
 
 	if (unlikely(put_page_testzero(page)))
-		free_unref_page(page, compound_order(page));
+		free_unref_page(page, compound_order(page), 0);
 }
 EXPORT_SYMBOL(page_frag_free);
 
@@ -5154,7 +5212,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
 
 		last = page + (1UL << order);
 		for (page += nr; page < last; page++)
-			__free_pages_ok(page, 0, FPI_TO_TAIL);
+			__free_pages_ok(page, 0, FPI_TO_TAIL, 0);
 	}
 	return (void *)addr;
 }
@@ -7124,7 +7182,7 @@ bool put_page_back_buddy(struct page *page)
 		int migratetype = get_pfnblock_migratetype(page, pfn);
 
 		ClearPageHWPoisonTakenOff(page);
-		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
+		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE, 0);
 		if (TestClearPageHWPoison(page)) {
 			ret = true;
 		}
@@ -7193,7 +7251,7 @@ static void __accept_page(struct zone *zone, unsigned long *flags,
 
 	accept_memory(page_to_phys(page), PAGE_SIZE << MAX_PAGE_ORDER);
 
-	__free_pages_ok(page, MAX_PAGE_ORDER, FPI_TO_TAIL);
+	__free_pages_ok(page, MAX_PAGE_ORDER, FPI_TO_TAIL, 0);
 
 	if (last)
 		static_branch_dec(&zones_with_unaccepted_pages);
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 7e04047977cfe..8467838d4dbc8 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -258,6 +258,12 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 		WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype));
 	} else {
 		set_pageblock_migratetype(page, migratetype);
+
+		/*
+		 * Do not clear the page's private to keep its luf_key
+		 * unchanged.
+		 */
+
 		__putback_isolated_page(page, order, migratetype);
 	}
 	zone->nr_isolate_pageblock--;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index e4c428e61d8c1..c05afb7a395f1 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -116,6 +116,12 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
 		int mt = get_pageblock_migratetype(page);
 		unsigned int order = get_order(sg->length);
 
+		/*
+		 * Ensure private is zero before putting into the
+		 * allocator.
+		 */
+		set_page_private(page, 0);
+
 		__putback_isolated_page(page, order, mt);
 
 		/* If the pages were not reported due to error skip flagging */
diff --git a/mm/swap.c b/mm/swap.c
index 10decd9dffa17..54b0ba10dbb86 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -109,7 +109,7 @@ void __folio_put(struct folio *folio)
 	page_cache_release(folio);
 	folio_unqueue_deferred_split(folio);
 	mem_cgroup_uncharge(folio);
-	free_unref_page(&folio->page, folio_order(folio));
+	free_unref_page(&folio->page, folio_order(folio), 0);
 }
 EXPORT_SYMBOL(__folio_put);
 
@@ -959,7 +959,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 
 	folios->nr = j;
 	mem_cgroup_uncharge_folios(folios);
-	free_unref_folios(folios);
+	free_unref_folios(folios, 0);
 }
 EXPORT_SYMBOL(folios_put_refs);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 76378bc257e38..2970a8f35d3d3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1480,7 +1480,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if (folio_batch_add(&free_folios, folio) == 0) {
 			mem_cgroup_uncharge_folios(&free_folios);
 			try_to_unmap_flush();
-			free_unref_folios(&free_folios);
+			free_unref_folios(&free_folios, 0);
 		}
 		continue;
 
@@ -1548,7 +1548,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 	mem_cgroup_uncharge_folios(&free_folios);
 	try_to_unmap_flush();
-	free_unref_folios(&free_folios);
+	free_unref_folios(&free_folios, 0);
 
 	list_splice(&ret_folios, folio_list);
 	count_vm_events(PGACTIVATE, pgactivate);
@@ -1868,7 +1868,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 			if (folio_batch_add(&free_folios, folio) == 0) {
 				spin_unlock_irq(&lruvec->lru_lock);
 				mem_cgroup_uncharge_folios(&free_folios);
-				free_unref_folios(&free_folios);
+				free_unref_folios(&free_folios, 0);
 				spin_lock_irq(&lruvec->lru_lock);
 			}
 
@@ -1890,7 +1890,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 	if (free_folios.nr) {
 		spin_unlock_irq(&lruvec->lru_lock);
 		mem_cgroup_uncharge_folios(&free_folios);
-		free_unref_folios(&free_folios);
+		free_unref_folios(&free_folios, 0);
 		spin_lock_irq(&lruvec->lru_lock);
 	}
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 12/26] mm: delimit critical sections to take off pages from pcp or buddy alloctor
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (10 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 11/26] mm: deliver luf_key to pcp or buddy on free after unmapping Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 13/26] mm: introduce pend_list in struct free_area to track luf'd pages Byungchul Park
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Now that luf mechanism has been introduced, tlb shootdown might be
necessary when luf'd pages exit from pcp or buddy allocator.  Check if
it's okay to take off pages and can perform for luf'd pages before use.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/compaction.c     | 32 ++++++++++++++++--
 mm/internal.h       |  2 +-
 mm/page_alloc.c     | 79 +++++++++++++++++++++++++++++++++++++++++++--
 mm/page_isolation.c |  4 ++-
 mm/page_reporting.c | 20 +++++++++++-
 5 files changed, 129 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 6009f5d1021a6..90f5c34f333db 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -605,6 +605,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 
 	page = pfn_to_page(blockpfn);
 
+	luf_takeoff_start();
 	/* Isolate free pages. */
 	for (; blockpfn < end_pfn; blockpfn += stride, page += stride) {
 		int isolated;
@@ -652,9 +653,12 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 				goto isolate_fail;
 		}
 
+		if (!luf_takeoff_check(page))
+			goto isolate_fail;
+
 		/* Found a free page, will break it into order-0 pages */
 		order = buddy_order(page);
-		isolated = __isolate_free_page(page, order);
+		isolated = __isolate_free_page(page, order, false);
 		if (!isolated)
 			break;
 		set_page_private(page, order);
@@ -682,6 +686,11 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	if (locked)
 		spin_unlock_irqrestore(&cc->zone->lock, flags);
 
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
+
 	/*
 	 * Be careful to not go outside of the pageblock.
 	 */
@@ -1589,6 +1598,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 		if (!area->nr_free)
 			continue;
 
+		luf_takeoff_start();
 		spin_lock_irqsave(&cc->zone->lock, flags);
 		freelist = &area->free_list[MIGRATE_MOVABLE];
 		list_for_each_entry_reverse(freepage, freelist, buddy_list) {
@@ -1596,6 +1606,10 @@ static void fast_isolate_freepages(struct compact_control *cc)
 
 			order_scanned++;
 			nr_scanned++;
+
+			if (!luf_takeoff_check(freepage))
+				goto scan_next;
+
 			pfn = page_to_pfn(freepage);
 
 			if (pfn >= highest)
@@ -1615,7 +1629,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 				/* Shorten the scan if a candidate is found */
 				limit >>= 1;
 			}
-
+scan_next:
 			if (order_scanned >= limit)
 				break;
 		}
@@ -1633,7 +1647,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 
 		/* Isolate the page if available */
 		if (page) {
-			if (__isolate_free_page(page, order)) {
+			if (__isolate_free_page(page, order, false)) {
 				set_page_private(page, order);
 				nr_isolated = 1 << order;
 				nr_scanned += nr_isolated - 1;
@@ -1650,6 +1664,11 @@ static void fast_isolate_freepages(struct compact_control *cc)
 
 		spin_unlock_irqrestore(&cc->zone->lock, flags);
 
+		/*
+		 * Check and flush before using the pages taken off.
+		 */
+		luf_takeoff_end();
+
 		/* Skip fast search if enough freepages isolated */
 		if (cc->nr_freepages >= cc->nr_migratepages)
 			break;
@@ -2369,7 +2388,14 @@ static enum compact_result compact_finished(struct compact_control *cc)
 {
 	int ret;
 
+	/*
+	 * luf_takeoff_{start,end}() is required to identify whether
+	 * this compaction context is tlb shootdownable for luf'd pages.
+	 */
+	luf_takeoff_start();
 	ret = __compact_finished(cc);
+	luf_takeoff_end();
+
 	trace_mm_compaction_finished(cc->zone, cc->order, ret);
 	if (ret == COMPACT_NO_SUITABLE_PAGE)
 		ret = COMPACT_CONTINUE;
diff --git a/mm/internal.h b/mm/internal.h
index 2bb54bc04260b..3a6da77d04ed3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -662,7 +662,7 @@ static inline void clear_zone_contiguous(struct zone *zone)
 	zone->contiguous = false;
 }
 
-extern int __isolate_free_page(struct page *page, unsigned int order);
+extern int __isolate_free_page(struct page *page, unsigned int order, bool willputback);
 extern void __putback_isolated_page(struct page *page, unsigned int order,
 				    int mt);
 extern void memblock_free_pages(struct page *page, unsigned long pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 05a1098f8c61f..f2ea69596ff15 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -869,8 +869,13 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 static inline struct page *get_page_from_free_area(struct free_area *area,
 					    int migratetype)
 {
-	return list_first_entry_or_null(&area->free_list[migratetype],
+	struct page *page = list_first_entry_or_null(&area->free_list[migratetype],
 					struct page, buddy_list);
+
+	if (page && luf_takeoff_check(page))
+		return page;
+
+	return NULL;
 }
 
 /*
@@ -1579,6 +1584,8 @@ static __always_inline void page_del_and_expand(struct zone *zone,
 	int nr_pages = 1 << high;
 
 	__del_page_from_free_list(page, zone, high, migratetype);
+	if (unlikely(!luf_takeoff_check_and_fold(page)))
+		VM_WARN_ON(1);
 	nr_pages -= expand(zone, page, low, high, migratetype);
 	account_freepages(zone, -nr_pages, migratetype);
 }
@@ -1950,6 +1957,13 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
 
 		del_page_from_free_list(buddy, zone, order,
 					get_pfnblock_migratetype(buddy, pfn));
+
+		/*
+		 * No need to luf_takeoff_check_and_fold() since it's
+		 * going back to buddy. luf_key will be handed over in
+		 * split_large_buddy().
+		 */
+
 		set_pageblock_migratetype(page, migratetype);
 		split_large_buddy(zone, buddy, pfn, order, FPI_NONE);
 		return true;
@@ -1961,6 +1975,13 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
 
 		del_page_from_free_list(page, zone, order,
 					get_pfnblock_migratetype(page, pfn));
+
+		/*
+		 * No need to luf_takeoff_check_and_fold() since it's
+		 * going back to buddy. luf_key will be handed over in
+		 * split_large_buddy().
+		 */
+
 		set_pageblock_migratetype(page, migratetype);
 		split_large_buddy(zone, page, pfn, order, FPI_NONE);
 		return true;
@@ -2085,6 +2106,8 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
 		unsigned int nr_added;
 
 		del_page_from_free_list(page, zone, current_order, block_type);
+		if (unlikely(!luf_takeoff_check_and_fold(page)))
+			VM_WARN_ON(1);
 		change_pageblock_range(page, current_order, start_type);
 		nr_added = expand(zone, page, order, current_order, start_type);
 		account_freepages(zone, nr_added, start_type);
@@ -2165,6 +2188,9 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 		if (free_area_empty(area, fallback_mt))
 			continue;
 
+		if (luf_takeoff_no_shootdown())
+			continue;
+
 		if (can_steal_fallback(order, migratetype))
 			*can_steal = true;
 
@@ -2256,6 +2282,11 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 					pageblock_nr_pages)
 			continue;
 
+		/*
+		 * luf_takeoff_{start,end}() is required for
+		 * get_page_from_free_area() to use luf_takeoff_check().
+		 */
+		luf_takeoff_start();
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < NR_PAGE_ORDERS; order++) {
 			struct free_area *area = &(zone->free_area[order]);
@@ -2313,10 +2344,12 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			WARN_ON_ONCE(ret == -1);
 			if (ret > 0) {
 				spin_unlock_irqrestore(&zone->lock, flags);
+				luf_takeoff_end();
 				return ret;
 			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
+		luf_takeoff_end();
 	}
 
 	return false;
@@ -2494,6 +2527,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	unsigned long flags;
 	int i;
 
+	luf_takeoff_start();
 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
@@ -2518,6 +2552,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		list_add_tail(&page->pcp_list, list);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 
 	return i;
 }
@@ -3012,7 +3050,7 @@ void split_page(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL_GPL(split_page);
 
-int __isolate_free_page(struct page *page, unsigned int order)
+int __isolate_free_page(struct page *page, unsigned int order, bool willputback)
 {
 	struct zone *zone = page_zone(page);
 	int mt = get_pageblock_migratetype(page);
@@ -3031,6 +3069,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	del_page_from_free_list(page, zone, order, mt);
+	if (unlikely(!willputback && !luf_takeoff_check_and_fold(page)))
+		VM_WARN_ON(1);
 
 	/*
 	 * Set the pageblock if the isolated page is at least half of a
@@ -3110,6 +3150,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 	do {
 		page = NULL;
+		luf_takeoff_start();
 		spin_lock_irqsave(&zone->lock, flags);
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
@@ -3127,10 +3168,15 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 			if (!page) {
 				spin_unlock_irqrestore(&zone->lock, flags);
+				luf_takeoff_end();
 				return NULL;
 			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
+		/*
+		 * Check and flush before using the pages taken off.
+		 */
+		luf_takeoff_end();
 	} while (check_new_pages(page, order));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3214,6 +3260,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		}
 
 		page = list_first_entry(list, struct page, pcp_list);
+		if (!luf_takeoff_check_and_fold(page))
+			return NULL;
 		list_del(&page->pcp_list);
 		pcp->count -= 1 << order;
 	} while (check_new_pages(page, order));
@@ -3231,11 +3279,13 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct page *page;
 	unsigned long __maybe_unused UP_flags;
 
+	luf_takeoff_start();
 	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (!pcp) {
 		pcp_trylock_finish(UP_flags);
+		luf_takeoff_end();
 		return NULL;
 	}
 
@@ -3249,6 +3299,10 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
 	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone, 1);
@@ -4853,6 +4907,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	if (unlikely(!zone))
 		goto failed;
 
+	luf_takeoff_start();
 	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
@@ -4891,6 +4946,10 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 
 	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
 	zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_account);
@@ -4900,6 +4959,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 
 failed_irq:
 	pcp_trylock_finish(UP_flags);
+	luf_takeoff_end();
 
 failed:
 	page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
@@ -7036,6 +7096,7 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 
 	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
+	luf_takeoff_start();
 	spin_lock_irqsave(&zone->lock, flags);
 	while (pfn < end_pfn) {
 		page = pfn_to_page(pfn);
@@ -7064,9 +7125,15 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 		VM_WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE);
 		order = buddy_order(page);
 		del_page_from_free_list(page, zone, order, MIGRATE_ISOLATE);
+		if (unlikely(!luf_takeoff_check_and_fold(page)))
+			VM_WARN_ON(1);
 		pfn += (1 << order);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 
 	return end_pfn - start_pfn - already_offline;
 }
@@ -7142,6 +7209,7 @@ bool take_page_off_buddy(struct page *page)
 	unsigned int order;
 	bool ret = false;
 
+	luf_takeoff_start();
 	spin_lock_irqsave(&zone->lock, flags);
 	for (order = 0; order < NR_PAGE_ORDERS; order++) {
 		struct page *page_head = page - (pfn & ((1 << order) - 1));
@@ -7154,6 +7222,8 @@ bool take_page_off_buddy(struct page *page)
 
 			del_page_from_free_list(page_head, zone, page_order,
 						migratetype);
+			if (unlikely(!luf_takeoff_check_and_fold(page_head)))
+				VM_WARN_ON(1);
 			break_down_buddy_pages(zone, page_head, page, 0,
 						page_order, migratetype);
 			SetPageHWPoisonTakenOff(page);
@@ -7164,6 +7234,11 @@ bool take_page_off_buddy(struct page *page)
 			break;
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
+
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 	return ret;
 }
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 8467838d4dbc8..eae33d188762b 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -211,6 +211,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	struct page *buddy;
 
 	zone = page_zone(page);
+	luf_takeoff_start();
 	spin_lock_irqsave(&zone->lock, flags);
 	if (!is_migrate_isolate_page(page))
 		goto out;
@@ -229,7 +230,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 			buddy = find_buddy_page_pfn(page, page_to_pfn(page),
 						    order, NULL);
 			if (buddy && !is_migrate_isolate_page(buddy)) {
-				isolated_page = !!__isolate_free_page(page, order);
+				isolated_page = !!__isolate_free_page(page, order, true);
 				/*
 				 * Isolating a free page in an isolated pageblock
 				 * is expected to always work as watermarks don't
@@ -269,6 +270,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	zone->nr_isolate_pageblock--;
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
+	luf_takeoff_end(zone);
 }
 
 static inline struct page *
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index c05afb7a395f1..03a7f5f6dc073 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -167,6 +167,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	if (list_empty(list))
 		return err;
 
+	luf_takeoff_start();
 	spin_lock_irq(&zone->lock);
 
 	/*
@@ -191,6 +192,11 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		if (PageReported(page))
 			continue;
 
+		if (!luf_takeoff_check(page)) {
+			VM_WARN_ON(1);
+			continue;
+		}
+
 		/*
 		 * If we fully consumed our budget then update our
 		 * state to indicate that we are requesting additional
@@ -204,7 +210,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 
 		/* Attempt to pull page from list and place in scatterlist */
 		if (*offset) {
-			if (!__isolate_free_page(page, order)) {
+			if (!__isolate_free_page(page, order, false)) {
 				next = page;
 				break;
 			}
@@ -227,6 +233,11 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		/* release lock before waiting on report processing */
 		spin_unlock_irq(&zone->lock);
 
+		/*
+		 * Check and flush before using the pages taken off.
+		 */
+		luf_takeoff_end();
+
 		/* begin processing pages in local list */
 		err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
 
@@ -236,6 +247,8 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		/* update budget to reflect call to report function */
 		budget--;
 
+		luf_takeoff_start();
+
 		/* reacquire zone lock and resume processing */
 		spin_lock_irq(&zone->lock);
 
@@ -259,6 +272,11 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 
 	spin_unlock_irq(&zone->lock);
 
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
+
 	return err;
 }
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 13/26] mm: introduce pend_list in struct free_area to track luf'd pages
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (11 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 12/26] mm: delimit critical sections to take off pages from pcp or buddy alloctor Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 14/26] mm/rmap: recognize read-only tlb entries during batched tlb flush Byungchul Park
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

luf'd pages requires tlb shootdown on exiting from page allocator. For
some page allocation request, it's okay to return luf'd page followed by
tlb shootdown but it's not okay for e.g. irq context.

This patch splitted the list in free_area into two, 'free_list' for
non-luf'd pages and 'pend_list' for luf'd pages so that the buddy
allocator can work better with various conditions of context.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mmzone.h  |   3 ++
 kernel/power/snapshot.c |  14 ++++++
 kernel/vmcore_info.c    |   2 +
 mm/compaction.c         |  33 ++++++++++---
 mm/internal.h           |  17 ++++++-
 mm/mm_init.c            |   2 +
 mm/page_alloc.c         | 105 ++++++++++++++++++++++++++++++++++------
 mm/page_reporting.c     |  22 ++++++---
 mm/vmstat.c             |  15 ++++++
 9 files changed, 184 insertions(+), 29 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b36124145a16f..ac3178b5fc50b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -116,6 +116,7 @@ extern int page_group_by_mobility_disabled;
 			MIGRATETYPE_MASK)
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
+	struct list_head	pend_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
 };
 
@@ -995,6 +996,8 @@ struct zone {
 	/* Zone statistics */
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 	atomic_long_t		vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
+	/* Count pages that need tlb shootdown on allocation */
+	atomic_long_t		nr_luf_pages;
 } ____cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 30894d8f0a781..863b0c54185dc 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1288,6 +1288,20 @@ static void mark_free_pages(struct zone *zone)
 				swsusp_set_page_free(pfn_to_page(pfn + i));
 			}
 		}
+
+		list_for_each_entry(page,
+				&zone->free_area[order].pend_list[t], buddy_list) {
+			unsigned long i;
+
+			pfn = page_to_pfn(page);
+			for (i = 0; i < (1UL << order); i++) {
+				if (!--page_count) {
+					touch_nmi_watchdog();
+					page_count = WD_PAGE_COUNT;
+				}
+				swsusp_set_page_free(pfn_to_page(pfn + i));
+			}
+		}
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index 1fec61603ef32..638deb57f9ddd 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -188,11 +188,13 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_OFFSET(zone, vm_stat);
 	VMCOREINFO_OFFSET(zone, spanned_pages);
 	VMCOREINFO_OFFSET(free_area, free_list);
+	VMCOREINFO_OFFSET(free_area, pend_list);
 	VMCOREINFO_OFFSET(list_head, next);
 	VMCOREINFO_OFFSET(list_head, prev);
 	VMCOREINFO_LENGTH(zone.free_area, NR_PAGE_ORDERS);
 	log_buf_vmcoreinfo_setup();
 	VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
+	VMCOREINFO_LENGTH(free_area.pend_list, MIGRATE_TYPES);
 	VMCOREINFO_NUMBER(NR_FREE_PAGES);
 	VMCOREINFO_NUMBER(PG_lru);
 	VMCOREINFO_NUMBER(PG_private);
diff --git a/mm/compaction.c b/mm/compaction.c
index 90f5c34f333db..27f3d743762bb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1590,24 +1590,28 @@ static void fast_isolate_freepages(struct compact_control *cc)
 	     order = next_search_order(cc, order)) {
 		struct free_area *area = &cc->zone->free_area[order];
 		struct list_head *freelist;
+		struct list_head *high_pfn_list;
 		struct page *freepage;
 		unsigned long flags;
 		unsigned int order_scanned = 0;
 		unsigned long high_pfn = 0;
+		bool consider_pend = false;
+		bool can_shootdown;
 
 		if (!area->nr_free)
 			continue;
 
-		luf_takeoff_start();
+		can_shootdown = luf_takeoff_start();
 		spin_lock_irqsave(&cc->zone->lock, flags);
 		freelist = &area->free_list[MIGRATE_MOVABLE];
+retry:
 		list_for_each_entry_reverse(freepage, freelist, buddy_list) {
 			unsigned long pfn;
 
 			order_scanned++;
 			nr_scanned++;
 
-			if (!luf_takeoff_check(freepage))
+			if (unlikely(consider_pend && !luf_takeoff_check(freepage)))
 				goto scan_next;
 
 			pfn = page_to_pfn(freepage);
@@ -1620,26 +1624,34 @@ static void fast_isolate_freepages(struct compact_control *cc)
 				cc->fast_search_fail = 0;
 				cc->search_order = order;
 				page = freepage;
-				break;
+				goto done;
 			}
 
 			if (pfn >= min_pfn && pfn > high_pfn) {
 				high_pfn = pfn;
+				high_pfn_list = freelist;
 
 				/* Shorten the scan if a candidate is found */
 				limit >>= 1;
 			}
 scan_next:
 			if (order_scanned >= limit)
-				break;
+				goto done;
 		}
 
+		if (!consider_pend && can_shootdown) {
+			consider_pend = true;
+			freelist = &area->pend_list[MIGRATE_MOVABLE];
+			goto retry;
+		}
+done:
 		/* Use a maximum candidate pfn if a preferred one was not found */
 		if (!page && high_pfn) {
 			page = pfn_to_page(high_pfn);
 
 			/* Update freepage for the list reorder below */
 			freepage = page;
+			freelist = high_pfn_list;
 		}
 
 		/* Reorder to so a future search skips recent pages */
@@ -2036,18 +2048,20 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
 		struct list_head *freelist;
 		unsigned long flags;
 		struct page *freepage;
+		bool consider_pend = false;
 
 		if (!area->nr_free)
 			continue;
 
 		spin_lock_irqsave(&cc->zone->lock, flags);
 		freelist = &area->free_list[MIGRATE_MOVABLE];
+retry:
 		list_for_each_entry(freepage, freelist, buddy_list) {
 			unsigned long free_pfn;
 
 			if (nr_scanned++ >= limit) {
 				move_freelist_tail(freelist, freepage);
-				break;
+				goto done;
 			}
 
 			free_pfn = page_to_pfn(freepage);
@@ -2070,9 +2084,16 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
 					pfn = cc->zone->zone_start_pfn;
 				cc->fast_search_fail = 0;
 				found_block = true;
-				break;
+				goto done;
 			}
 		}
+
+		if (!consider_pend) {
+			consider_pend = true;
+			freelist = &area->pend_list[MIGRATE_MOVABLE];
+			goto retry;
+		}
+done:
 		spin_unlock_irqrestore(&cc->zone->lock, flags);
 	}
 
diff --git a/mm/internal.h b/mm/internal.h
index 3a6da77d04ed3..0dc374553f9b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -836,11 +836,16 @@ void init_cma_reserved_pageblock(struct page *page);
 int find_suitable_fallback(struct free_area *area, unsigned int order,
 			int migratetype, bool only_stealable, bool *can_steal);
 
-static inline bool free_area_empty(struct free_area *area, int migratetype)
+static inline bool free_list_empty(struct free_area *area, int migratetype)
 {
 	return list_empty(&area->free_list[migratetype]);
 }
 
+static inline bool free_area_empty(struct free_area *area, int migratetype)
+{
+	return list_empty(&area->free_list[migratetype]) &&
+	       list_empty(&area->pend_list[migratetype]);
+}
 /* mm/util.c */
 struct anon_vma *folio_anon_vma(const struct folio *folio);
 
@@ -1590,12 +1595,22 @@ void luf_takeoff_end(void);
 bool luf_takeoff_no_shootdown(void);
 bool luf_takeoff_check(struct page *page);
 bool luf_takeoff_check_and_fold(struct page *page);
+
+static inline bool non_luf_pages_ok(struct zone *zone)
+{
+	unsigned long nr_free = zone_page_state(zone, NR_FREE_PAGES);
+	unsigned long min_wm = min_wmark_pages(zone);
+	unsigned long nr_luf_pages = atomic_long_read(&zone->nr_luf_pages);
+
+	return nr_free - nr_luf_pages > min_wm;
+}
 #else
 static inline bool luf_takeoff_start(void) { return false; }
 static inline void luf_takeoff_end(void) {}
 static inline bool luf_takeoff_no_shootdown(void) { return true; }
 static inline bool luf_takeoff_check(struct page *page) { return true; }
 static inline bool luf_takeoff_check_and_fold(struct page *page) { return true; }
+static inline bool non_luf_pages_ok(struct zone *zone) { return true; }
 #endif
 
 /* pagewalk.c */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 1c205b0a86ed5..12b96cd6a87b0 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1396,12 +1396,14 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	unsigned int order, t;
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
+		INIT_LIST_HEAD(&zone->free_area[order].pend_list[t]);
 		zone->free_area[order].nr_free = 0;
 	}
 
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	INIT_LIST_HEAD(&zone->unaccepted_pages);
 #endif
+	atomic_long_set(&zone->nr_luf_pages, 0);
 }
 
 void __meminit init_currently_empty_zone(struct zone *zone,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f2ea69596ff15..65acc437d8387 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -804,15 +804,28 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
 				      bool tail)
 {
 	struct free_area *area = &zone->free_area[order];
+	struct list_head *list;
 
 	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
 		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
 		     get_pageblock_migratetype(page), migratetype, 1 << order);
 
+	/*
+	 * When identifying whether a page requires tlb shootdown, false
+	 * positive is okay because it will cause just additional tlb
+	 * shootdown.
+	 */
+	if (page_luf_key(page)) {
+		list = &area->pend_list[migratetype];
+		atomic_long_add(1 << order, &zone->nr_luf_pages);
+	} else
+		list = &area->free_list[migratetype];
+
 	if (tail)
-		list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
+		list_add_tail(&page->buddy_list, list);
 	else
-		list_add(&page->buddy_list, &area->free_list[migratetype]);
+		list_add(&page->buddy_list, list);
+
 	area->nr_free++;
 }
 
@@ -831,7 +844,20 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
 		     get_pageblock_migratetype(page), old_mt, 1 << order);
 
-	list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
+	/*
+	 * The page might have been taken from a pfn where it's not
+	 * clear which list was used.  Therefore, conservatively
+	 * consider it as pend_list, not to miss any true ones that
+	 * require tlb shootdown.
+	 *
+	 * When identifying whether a page requires tlb shootdown, false
+	 * positive is okay because it will cause just additional tlb
+	 * shootdown.
+	 */
+	if (page_luf_key(page))
+		list_move_tail(&page->buddy_list, &area->pend_list[new_mt]);
+	else
+		list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
 
 	account_freepages(zone, -(1 << order), old_mt);
 	account_freepages(zone, 1 << order, new_mt);
@@ -848,6 +874,9 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
 	if (page_reported(page))
 		__ClearPageReported(page);
 
+	if (page_luf_key(page))
+		atomic_long_sub(1 << order, &zone->nr_luf_pages);
+
 	list_del(&page->buddy_list);
 	__ClearPageBuddy(page);
 	zone->free_area[order].nr_free--;
@@ -866,15 +895,48 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 	account_freepages(zone, -(1 << order), migratetype);
 }
 
-static inline struct page *get_page_from_free_area(struct free_area *area,
-					    int migratetype)
+static inline struct page *get_page_from_free_area(struct zone *zone,
+		struct free_area *area, int migratetype)
 {
-	struct page *page = list_first_entry_or_null(&area->free_list[migratetype],
-					struct page, buddy_list);
+	struct page *page;
+	bool pend_first;
 
-	if (page && luf_takeoff_check(page))
-		return page;
+	/*
+	 * XXX: Make the decision preciser if needed e.g. using
+	 * zone_watermark_ok() or its family, but for now, don't want to
+	 * make it heavier.
+	 *
+	 * Try free_list, holding non-luf pages, first if there are
+	 * enough non-luf pages to aggressively defer tlb flush, but
+	 * should try pend_list first instead if not.
+	 */
+	pend_first = !non_luf_pages_ok(zone);
+
+	if (pend_first) {
+		page = list_first_entry_or_null(&area->pend_list[migratetype],
+				struct page, buddy_list);
+
+		if (page && luf_takeoff_check(page))
+			return page;
+
+		page = list_first_entry_or_null(&area->free_list[migratetype],
+				struct page, buddy_list);
+
+		if (page)
+			return page;
+	} else {
+		page = list_first_entry_or_null(&area->free_list[migratetype],
+				struct page, buddy_list);
+
+		if (page)
+			return page;
 
+		page = list_first_entry_or_null(&area->pend_list[migratetype],
+				struct page, buddy_list);
+
+		if (page && luf_takeoff_check(page))
+			return page;
+	}
 	return NULL;
 }
 
@@ -1027,6 +1089,8 @@ static inline void __free_one_page(struct page *page,
 
 	if (fpi_flags & FPI_TO_TAIL)
 		to_tail = true;
+	else if (page_luf_key(page))
+		to_tail = true;
 	else if (is_shuffle_order(order))
 		to_tail = shuffle_pick_tail();
 	else
@@ -1556,6 +1620,8 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 	unsigned int nr_added = 0;
 
 	while (high > low) {
+		bool tail = false;
+
 		high--;
 		size >>= 1;
 		VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);
@@ -1569,7 +1635,10 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 		if (set_page_guard(zone, &page[size], high))
 			continue;
 
-		__add_to_free_list(&page[size], zone, high, migratetype, false);
+		if (page_luf_key(&page[size]))
+			tail = true;
+
+		__add_to_free_list(&page[size], zone, high, migratetype, tail);
 		set_buddy_order(&page[size], high);
 		nr_added += size;
 	}
@@ -1754,7 +1823,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) {
 		area = &(zone->free_area[current_order]);
-		page = get_page_from_free_area(area, migratetype);
+		page = get_page_from_free_area(zone, area, migratetype);
 		if (!page)
 			continue;
 
@@ -2188,7 +2257,8 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 		if (free_area_empty(area, fallback_mt))
 			continue;
 
-		if (luf_takeoff_no_shootdown())
+		if (free_list_empty(area, fallback_mt) &&
+		    luf_takeoff_no_shootdown())
 			continue;
 
 		if (can_steal_fallback(order, migratetype))
@@ -2292,7 +2362,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			struct free_area *area = &(zone->free_area[order]);
 			int mt;
 
-			page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
+			page = get_page_from_free_area(zone, area, MIGRATE_HIGHATOMIC);
 			if (!page)
 				continue;
 
@@ -2430,7 +2500,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 	VM_BUG_ON(current_order > MAX_PAGE_ORDER);
 
 do_steal:
-	page = get_page_from_free_area(area, fallback_mt);
+	page = get_page_from_free_area(zone, area, fallback_mt);
 
 	/* take off list, maybe claim block, expand remainder */
 	page = steal_suitable_fallback(zone, page, current_order, order,
@@ -7180,6 +7250,8 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
 	struct page *current_buddy;
 
 	while (high > low) {
+		bool tail = false;
+
 		high--;
 		size >>= 1;
 
@@ -7193,7 +7265,10 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, current_buddy, high))
 			continue;
 
-		add_to_free_list(current_buddy, zone, high, migratetype, false);
+		if (page_luf_key(current_buddy))
+			tail = true;
+
+		add_to_free_list(current_buddy, zone, high, migratetype, tail);
 		set_buddy_order(current_buddy, high);
 	}
 }
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 03a7f5f6dc073..e152b22fbba8a 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -159,15 +159,17 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	struct page *page, *next;
 	long budget;
 	int err = 0;
+	bool consider_pend = false;
+	bool can_shootdown;
 
 	/*
 	 * Perform early check, if free area is empty there is
 	 * nothing to process so we can skip this free_list.
 	 */
-	if (list_empty(list))
+	if (free_area_empty(area, mt))
 		return err;
 
-	luf_takeoff_start();
+	can_shootdown = luf_takeoff_start();
 	spin_lock_irq(&zone->lock);
 
 	/*
@@ -185,14 +187,14 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	 * should always be a power of 2.
 	 */
 	budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
-
+retry:
 	/* loop through free list adding unreported pages to sg list */
 	list_for_each_entry_safe(page, next, list, lru) {
 		/* We are going to skip over the reported pages. */
 		if (PageReported(page))
 			continue;
 
-		if (!luf_takeoff_check(page)) {
+		if (unlikely(consider_pend && !luf_takeoff_check(page))) {
 			VM_WARN_ON(1);
 			continue;
 		}
@@ -205,14 +207,14 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		if (budget < 0) {
 			atomic_set(&prdev->state, PAGE_REPORTING_REQUESTED);
 			next = page;
-			break;
+			goto done;
 		}
 
 		/* Attempt to pull page from list and place in scatterlist */
 		if (*offset) {
 			if (!__isolate_free_page(page, order, false)) {
 				next = page;
-				break;
+				goto done;
 			}
 
 			/* Add page to scatter list */
@@ -263,9 +265,15 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 
 		/* exit on error */
 		if (err)
-			break;
+			goto done;
 	}
 
+	if (!consider_pend && can_shootdown) {
+		consider_pend = true;
+		list = &area->pend_list[mt];
+		goto retry;
+	}
+done:
 	/* Rotate any leftover pages to the head of the freelist */
 	if (!list_entry_is_head(next, list, lru) && !list_is_first(&next->lru, list))
 		list_rotate_to_front(&next->lru, list);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4d016314a56c9..3fb9a5f6dd6da 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1581,6 +1581,21 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 					break;
 				}
 			}
+			list_for_each(curr, &area->pend_list[mtype]) {
+				/*
+				 * Cap the pend_list iteration because it might
+				 * be really large and we are under a spinlock
+				 * so a long time spent here could trigger a
+				 * hard lockup detector. Anyway this is a
+				 * debugging tool so knowing there is a handful
+				 * of pages of this order should be more than
+				 * sufficient.
+				 */
+				if (++freecount >= 100000) {
+					overflow = true;
+					break;
+				}
+			}
 			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
 			spin_unlock_irq(&zone->lock);
 			cond_resched();
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 14/26] mm/rmap: recognize read-only tlb entries during batched tlb flush
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (12 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 13/26] mm: introduce pend_list in struct free_area to track luf'd pages Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 15/26] fs, filemap: refactor to gather the scattered ->write_{begin,end}() calls Byungchul Park
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
requires to recognize read-only tlb entries and handle them in a
different way.  The newly introduced API in this patch, fold_ubc(), will
be used by luf mechanism.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/sched.h |  1 +
 mm/rmap.c             | 16 ++++++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c4ff83e1d5953..a217d6011fdfe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1381,6 +1381,7 @@ struct task_struct {
 
 	struct tlbflush_unmap_batch	tlb_ubc;
 	struct tlbflush_unmap_batch	tlb_ubc_takeoff;
+	struct tlbflush_unmap_batch	tlb_ubc_ro;
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/mm/rmap.c b/mm/rmap.c
index 1581b1a00f974..3ed6234dd777e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -775,6 +775,7 @@ void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
 void try_to_unmap_flush_takeoff(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 
 	if (!tlb_ubc_takeoff->flush_required)
@@ -789,6 +790,9 @@ void try_to_unmap_flush_takeoff(void)
 	if (arch_tlbbatch_done(&tlb_ubc->arch, &tlb_ubc_takeoff->arch))
 		reset_batch(tlb_ubc);
 
+	if (arch_tlbbatch_done(&tlb_ubc_ro->arch, &tlb_ubc_takeoff->arch))
+		reset_batch(tlb_ubc_ro);
+
 	reset_batch(tlb_ubc_takeoff);
 }
 
@@ -801,7 +805,9 @@ void try_to_unmap_flush_takeoff(void)
 void try_to_unmap_flush(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 
+	fold_batch(tlb_ubc, tlb_ubc_ro, true);
 	if (!tlb_ubc->flush_required)
 		return;
 
@@ -813,8 +819,9 @@ void try_to_unmap_flush(void)
 void try_to_unmap_flush_dirty(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 
-	if (tlb_ubc->writable)
+	if (tlb_ubc->writable || tlb_ubc_ro->writable)
 		try_to_unmap_flush();
 }
 
@@ -831,13 +838,18 @@ void try_to_unmap_flush_dirty(void)
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 				      unsigned long uaddr)
 {
-	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc;
 	int batch;
 	bool writable = pte_dirty(pteval);
 
 	if (!pte_accessible(mm, pteval))
 		return;
 
+	if (pte_write(pteval))
+		tlb_ubc = &current->tlb_ubc;
+	else
+		tlb_ubc = &current->tlb_ubc_ro;
+
 	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 15/26] fs, filemap: refactor to gather the scattered ->write_{begin,end}() calls
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (13 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 14/26] mm/rmap: recognize read-only tlb entries during batched tlb flush Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 16/26] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped Byungchul Park
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
requires to hook when updating page cache that might have pages that
have been mapped on any tasks so that tlb flush needed can be performed.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 11 ++++-------
 fs/affs/file.c                            |  4 ++--
 fs/buffer.c                               | 14 ++++++--------
 fs/exfat/file.c                           |  5 ++---
 fs/ext4/verity.c                          |  5 ++---
 fs/f2fs/super.c                           |  5 ++---
 fs/f2fs/verity.c                          |  5 ++---
 fs/namei.c                                |  5 ++---
 include/linux/fs.h                        | 18 ++++++++++++++++++
 mm/filemap.c                              |  5 ++---
 10 files changed, 42 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index fe69f2c8527d7..1d475d681d3de 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -422,7 +422,6 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
 	     const struct drm_i915_gem_pwrite *arg)
 {
 	struct address_space *mapping = obj->base.filp->f_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 	char __user *user_data = u64_to_user_ptr(arg->data_ptr);
 	u64 remain;
 	loff_t pos;
@@ -481,7 +480,7 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
 		if (err)
 			return err;
 
-		err = aops->write_begin(obj->base.filp, mapping, pos, len,
+		err = mapping_write_begin(obj->base.filp, mapping, pos, len,
 					&folio, &data);
 		if (err < 0)
 			return err;
@@ -492,7 +491,7 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
 		pagefault_enable();
 		kunmap_local(vaddr);
 
-		err = aops->write_end(obj->base.filp, mapping, pos, len,
+		err = mapping_write_end(obj->base.filp, mapping, pos, len,
 				      len - unwritten, folio, data);
 		if (err < 0)
 			return err;
@@ -658,7 +657,6 @@ i915_gem_object_create_shmem_from_data(struct drm_i915_private *i915,
 {
 	struct drm_i915_gem_object *obj;
 	struct file *file;
-	const struct address_space_operations *aops;
 	loff_t pos;
 	int err;
 
@@ -670,21 +668,20 @@ i915_gem_object_create_shmem_from_data(struct drm_i915_private *i915,
 	GEM_BUG_ON(obj->write_domain != I915_GEM_DOMAIN_CPU);
 
 	file = obj->base.filp;
-	aops = file->f_mapping->a_ops;
 	pos = 0;
 	do {
 		unsigned int len = min_t(typeof(size), size, PAGE_SIZE);
 		struct folio *folio;
 		void *fsdata;
 
-		err = aops->write_begin(file, file->f_mapping, pos, len,
+		err = mapping_write_begin(file, file->f_mapping, pos, len,
 					&folio, &fsdata);
 		if (err < 0)
 			goto fail;
 
 		memcpy_to_folio(folio, offset_in_folio(folio, pos), data, len);
 
-		err = aops->write_end(file, file->f_mapping, pos, len, len,
+		err = mapping_write_end(file, file->f_mapping, pos, len, len,
 				      folio, fsdata);
 		if (err < 0)
 			goto fail;
diff --git a/fs/affs/file.c b/fs/affs/file.c
index a5a861dd52230..10e7f53828e93 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -885,9 +885,9 @@ affs_truncate(struct inode *inode)
 		loff_t isize = inode->i_size;
 		int res;
 
-		res = mapping->a_ops->write_begin(NULL, mapping, isize, 0, &folio, &fsdata);
+		res = mapping_write_begin(NULL, mapping, isize, 0, &folio, &fsdata);
 		if (!res)
-			res = mapping->a_ops->write_end(NULL, mapping, isize, 0, 0, folio, fsdata);
+			res = mapping_write_end(NULL, mapping, isize, 0, 0, folio, fsdata);
 		else
 			inode->i_size = AFFS_I(inode)->mmu_private;
 		mark_inode_dirty(inode);
diff --git a/fs/buffer.c b/fs/buffer.c
index 88e765b0699fe..7cb0295500937 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2456,7 +2456,6 @@ EXPORT_SYMBOL(block_read_full_folio);
 int generic_cont_expand_simple(struct inode *inode, loff_t size)
 {
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 	struct folio *folio;
 	void *fsdata = NULL;
 	int err;
@@ -2465,11 +2464,11 @@ int generic_cont_expand_simple(struct inode *inode, loff_t size)
 	if (err)
 		goto out;
 
-	err = aops->write_begin(NULL, mapping, size, 0, &folio, &fsdata);
+	err = mapping_write_begin(NULL, mapping, size, 0, &folio, &fsdata);
 	if (err)
 		goto out;
 
-	err = aops->write_end(NULL, mapping, size, 0, 0, folio, fsdata);
+	err = mapping_write_end(NULL, mapping, size, 0, 0, folio, fsdata);
 	BUG_ON(err > 0);
 
 out:
@@ -2481,7 +2480,6 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
 			    loff_t pos, loff_t *bytes)
 {
 	struct inode *inode = mapping->host;
-	const struct address_space_operations *aops = mapping->a_ops;
 	unsigned int blocksize = i_blocksize(inode);
 	struct folio *folio;
 	void *fsdata = NULL;
@@ -2501,12 +2499,12 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
 		}
 		len = PAGE_SIZE - zerofrom;
 
-		err = aops->write_begin(file, mapping, curpos, len,
+		err = mapping_write_begin(file, mapping, curpos, len,
 					    &folio, &fsdata);
 		if (err)
 			goto out;
 		folio_zero_range(folio, offset_in_folio(folio, curpos), len);
-		err = aops->write_end(file, mapping, curpos, len, len,
+		err = mapping_write_end(file, mapping, curpos, len, len,
 						folio, fsdata);
 		if (err < 0)
 			goto out;
@@ -2534,12 +2532,12 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
 		}
 		len = offset - zerofrom;
 
-		err = aops->write_begin(file, mapping, curpos, len,
+		err = mapping_write_begin(file, mapping, curpos, len,
 					    &folio, &fsdata);
 		if (err)
 			goto out;
 		folio_zero_range(folio, offset_in_folio(folio, curpos), len);
-		err = aops->write_end(file, mapping, curpos, len, len,
+		err = mapping_write_end(file, mapping, curpos, len, len,
 						folio, fsdata);
 		if (err < 0)
 			goto out;
diff --git a/fs/exfat/file.c b/fs/exfat/file.c
index a25d7eb789f4c..242563b9dec95 100644
--- a/fs/exfat/file.c
+++ b/fs/exfat/file.c
@@ -539,7 +539,6 @@ static int exfat_extend_valid_size(struct file *file, loff_t new_valid_size)
 	struct inode *inode = file_inode(file);
 	struct exfat_inode_info *ei = EXFAT_I(inode);
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *ops = mapping->a_ops;
 
 	pos = ei->valid_size;
 	while (pos < new_valid_size) {
@@ -550,11 +549,11 @@ static int exfat_extend_valid_size(struct file *file, loff_t new_valid_size)
 		if (pos + len > new_valid_size)
 			len = new_valid_size - pos;
 
-		err = ops->write_begin(file, mapping, pos, len, &folio, NULL);
+		err = mapping_write_begin(file, mapping, pos, len, &folio, NULL);
 		if (err)
 			goto out;
 
-		err = ops->write_end(file, mapping, pos, len, len, folio, NULL);
+		err = mapping_write_end(file, mapping, pos, len, len, folio, NULL);
 		if (err < 0)
 			goto out;
 		pos += len;
diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index d9203228ce979..64fa43f80c73e 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -68,7 +68,6 @@ static int pagecache_write(struct inode *inode, const void *buf, size_t count,
 			   loff_t pos)
 {
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 
 	if (pos + count > inode->i_sb->s_maxbytes)
 		return -EFBIG;
@@ -80,13 +79,13 @@ static int pagecache_write(struct inode *inode, const void *buf, size_t count,
 		void *fsdata = NULL;
 		int res;
 
-		res = aops->write_begin(NULL, mapping, pos, n, &folio, &fsdata);
+		res = mapping_write_begin(NULL, mapping, pos, n, &folio, &fsdata);
 		if (res)
 			return res;
 
 		memcpy_to_folio(folio, offset_in_folio(folio, pos), buf, n);
 
-		res = aops->write_end(NULL, mapping, pos, n, n, folio, fsdata);
+		res = mapping_write_end(NULL, mapping, pos, n, n, folio, fsdata);
 		if (res < 0)
 			return res;
 		if (res != n)
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 87ab5696bd482..f8d5ee466807c 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -2678,7 +2678,6 @@ static ssize_t f2fs_quota_write(struct super_block *sb, int type,
 {
 	struct inode *inode = sb_dqopt(sb)->files[type];
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
 	int offset = off & (sb->s_blocksize - 1);
 	size_t towrite = len;
 	struct folio *folio;
@@ -2690,7 +2689,7 @@ static ssize_t f2fs_quota_write(struct super_block *sb, int type,
 		tocopy = min_t(unsigned long, sb->s_blocksize - offset,
 								towrite);
 retry:
-		err = a_ops->write_begin(NULL, mapping, off, tocopy,
+		err = mapping_write_begin(NULL, mapping, off, tocopy,
 							&folio, &fsdata);
 		if (unlikely(err)) {
 			if (err == -ENOMEM) {
@@ -2703,7 +2702,7 @@ static ssize_t f2fs_quota_write(struct super_block *sb, int type,
 
 		memcpy_to_folio(folio, offset_in_folio(folio, off), data, tocopy);
 
-		a_ops->write_end(NULL, mapping, off, tocopy, tocopy,
+		mapping_write_end(NULL, mapping, off, tocopy, tocopy,
 						folio, fsdata);
 		offset = 0;
 		towrite -= tocopy;
diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
index 2287f238ae09e..b232589546d39 100644
--- a/fs/f2fs/verity.c
+++ b/fs/f2fs/verity.c
@@ -72,7 +72,6 @@ static int pagecache_write(struct inode *inode, const void *buf, size_t count,
 			   loff_t pos)
 {
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 
 	if (pos + count > F2FS_BLK_TO_BYTES(max_file_blocks(inode)))
 		return -EFBIG;
@@ -84,13 +83,13 @@ static int pagecache_write(struct inode *inode, const void *buf, size_t count,
 		void *fsdata = NULL;
 		int res;
 
-		res = aops->write_begin(NULL, mapping, pos, n, &folio, &fsdata);
+		res = mapping_write_begin(NULL, mapping, pos, n, &folio, &fsdata);
 		if (res)
 			return res;
 
 		memcpy_to_folio(folio, offset_in_folio(folio, pos), buf, n);
 
-		res = aops->write_end(NULL, mapping, pos, n, n, folio, fsdata);
+		res = mapping_write_end(NULL, mapping, pos, n, n, folio, fsdata);
 		if (res < 0)
 			return res;
 		if (res != n)
diff --git a/fs/namei.c b/fs/namei.c
index 4a4a22a08ac20..14a701ecf1a7e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -5349,7 +5349,6 @@ EXPORT_SYMBOL(page_readlink);
 int page_symlink(struct inode *inode, const char *symname, int len)
 {
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 	bool nofs = !mapping_gfp_constraint(mapping, __GFP_FS);
 	struct folio *folio;
 	void *fsdata = NULL;
@@ -5359,7 +5358,7 @@ int page_symlink(struct inode *inode, const char *symname, int len)
 retry:
 	if (nofs)
 		flags = memalloc_nofs_save();
-	err = aops->write_begin(NULL, mapping, 0, len-1, &folio, &fsdata);
+	err = mapping_write_begin(NULL, mapping, 0, len-1, &folio, &fsdata);
 	if (nofs)
 		memalloc_nofs_restore(flags);
 	if (err)
@@ -5367,7 +5366,7 @@ int page_symlink(struct inode *inode, const char *symname, int len)
 
 	memcpy(folio_address(folio), symname, len - 1);
 
-	err = aops->write_end(NULL, mapping, 0, len - 1, len - 1,
+	err = mapping_write_end(NULL, mapping, 0, len - 1, len - 1,
 						folio, fsdata);
 	if (err < 0)
 		goto fail;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3559446279c15..bfd8aaeb78bb8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -494,6 +494,24 @@ struct address_space {
 #define PAGECACHE_TAG_WRITEBACK	XA_MARK_1
 #define PAGECACHE_TAG_TOWRITE	XA_MARK_2
 
+static inline int mapping_write_begin(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len,
+				struct folio **foliop, void **fsdata)
+{
+	return mapping->a_ops->write_begin(file, mapping, pos, len, foliop,
+			fsdata);
+}
+
+static inline int mapping_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct folio *folio, void *fsdata)
+{
+	return mapping->a_ops->write_end(file, mapping, pos, len, copied,
+			folio, fsdata);
+}
+
 /*
  * Returns true if any of the pages in the mapping are marked with the tag.
  */
diff --git a/mm/filemap.c b/mm/filemap.c
index e582a1545d2ae..a4930449fc705 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -4016,7 +4016,6 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 	struct file *file = iocb->ki_filp;
 	loff_t pos = iocb->ki_pos;
 	struct address_space *mapping = file->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
 	size_t chunk = mapping_max_folio_size(mapping);
 	long status = 0;
 	ssize_t written = 0;
@@ -4050,7 +4049,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 			break;
 		}
 
-		status = a_ops->write_begin(file, mapping, pos, bytes,
+		status = mapping_write_begin(file, mapping, pos, bytes,
 						&folio, &fsdata);
 		if (unlikely(status < 0))
 			break;
@@ -4065,7 +4064,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
 		flush_dcache_folio(folio);
 
-		status = a_ops->write_end(file, mapping, pos, bytes, copied,
+		status = mapping_write_end(file, mapping, pos, bytes, copied,
 						folio, fsdata);
 		if (unlikely(status != copied)) {
 			iov_iter_revert(i, copied - max(status, 0L));
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 16/26] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (14 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 15/26] fs, filemap: refactor to gather the scattered ->write_{begin,end}() calls Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 17/26] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done Byungchul Park
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read-only and were unmapped, as
long as the contents of the folios don't change while staying in pcp or
buddy so we can still read the data through the stale tlb entries.

tlb flush can be defered when folios get unmapped as long as it
guarantees to perform tlb flush needed, before the folios actually
become used, of course, only if all the corresponding ptes don't have
write permission.  Otherwise, the system will get messed up.

To achieve that, for the folios that map only to non-writable tlb
entries, prevent tlb flush during unmapping but perform it just before
the folios actually become used, out of buddy or pcp.

However, we should cancel the pending by LUF and perform the deferred
TLB flush right away when:

   1. a writable pte is newly set through fault handler
   2. a file is updated
   3. kasan needs poisoning on free
   4. the kernel wants to init pages on free

No matter what type of workload is used for performance evaluation, the
result would be positive thanks to the unconditional reduction of tlb
flushes, tlb misses and interrupts.  For the test, I picked up one of
the most popular and heavy workload, llama.cpp that is a
LLM(Large Language Model) inference engine.

The result would depend on memory latency and how often reclaim runs,
which implies tlb miss overhead and how many times unmapping happens.
In my system, the result shows:

   1. tlb shootdown interrupts are reduced about 97%.
   2. The test program runtime is reduced about 4.5%.

The test environment and the test set are like:

   Machine: bare metal, x86_64, Intel(R) Xeon(R) Gold 6430
   CPU: 1 socket 64 core with hyper thread on
   Numa: 2 nodes (64 CPUs DRAM 42GB, no CPUs CXL expander 98GB)
   Config: swap off, numa balancing tiering on, demotion enabled

   llama.cpp/main -m $(70G_model1) -p "who are you?" -s 1 -t 15 -n 20 &
   llama.cpp/main -m $(70G_model2) -p "who are you?" -s 1 -t 15 -n 20 &
   llama.cpp/main -m $(70G_model3) -p "who are you?" -s 1 -t 15 -n 20 &
   wait

   where,
   -t: nr of threads, -s: seed used to make the runtime stable,
   -n: nr of tokens that determines the runtime, -p: prompt to ask,
   -m: LLM model to use.

Run the test set 5 times successively with caches dropped every run via
'echo 3 > /proc/sys/vm/drop_caches'.  Each inference prints its runtime
at the end of each.  The results are like:

   1. Runtime from the output of llama.cpp

   BEFORE
   ------
   llama_print_timings:       total time =  883450.54 ms /    24 tokens
   llama_print_timings:       total time =  861665.91 ms /    24 tokens
   llama_print_timings:       total time =  898079.02 ms /    24 tokens
   llama_print_timings:       total time =  879897.69 ms /    24 tokens
   llama_print_timings:       total time =  892360.75 ms /    24 tokens
   llama_print_timings:       total time =  884587.85 ms /    24 tokens
   llama_print_timings:       total time =  861023.19 ms /    24 tokens
   llama_print_timings:       total time =  900022.18 ms /    24 tokens
   llama_print_timings:       total time =  878771.88 ms /    24 tokens
   llama_print_timings:       total time =  889027.98 ms /    24 tokens
   llama_print_timings:       total time =  880783.90 ms /    24 tokens
   llama_print_timings:       total time =  856475.29 ms /    24 tokens
   llama_print_timings:       total time =  896842.21 ms /    24 tokens
   llama_print_timings:       total time =  878883.53 ms /    24 tokens
   llama_print_timings:       total time =  890122.10 ms /    24 tokens

   AFTER
   -----
   llama_print_timings:       total time =  871060.86 ms /    24 tokens
   llama_print_timings:       total time =  825609.53 ms /    24 tokens
   llama_print_timings:       total time =  836854.81 ms /    24 tokens
   llama_print_timings:       total time =  843147.99 ms /    24 tokens
   llama_print_timings:       total time =  831426.65 ms /    24 tokens
   llama_print_timings:       total time =  873939.23 ms /    24 tokens
   llama_print_timings:       total time =  826127.69 ms /    24 tokens
   llama_print_timings:       total time =  835489.26 ms /    24 tokens
   llama_print_timings:       total time =  842589.62 ms /    24 tokens
   llama_print_timings:       total time =  833700.66 ms /    24 tokens
   llama_print_timings:       total time =  875996.19 ms /    24 tokens
   llama_print_timings:       total time =  826401.73 ms /    24 tokens
   llama_print_timings:       total time =  839341.28 ms /    24 tokens
   llama_print_timings:       total time =  841075.10 ms /    24 tokens
   llama_print_timings:       total time =  835136.41 ms /    24 tokens

   2. tlb shootdowns from 'cat /proc/interrupts'

   BEFORE
   ------
   TLB:
    80911532   93691786  100296251  111062810  109769109  109862429
   108968588  119175230  115779676  118377498  119325266  120300143
   124514185  116697222  121068466  118031913  122660681  117494403
   121819907  116960596  120936335  117217061  118630217  122322724
   119595577  111693298  119232201  120030377  115334687  113179982
   118808254  116353592  140987367  137095516  131724276  139742240
   136501150  130428761  127585535  132483981  133430250  133756207
   131786710  126365824  129812539  133850040  131742690  125142213
   128572830  132234350  131945922  128417707  133355434  129972846
   126331823  134050849  133991626  121129038  124637283  132830916
   126875507  122322440  125776487  124340278   TLB shootdowns

   AFTER
   -----
   TLB:
     2121206    2615108    2983494    2911950    3055086    3092672
     3204894    3346082    3286744    3307310    3357296    3315940
     3428034    3112596    3143325    3185551    3186493    3322314
     3330523    3339663    3156064    3272070    3296309    3198962
     3332662    3315870    3234467    3353240    3281234    3300666
     3345452    3173097    4009196    3932215    3898735    3726531
     3717982    3671726    3728788    3724613    3799147    3691764
     3620630    3684655    3666688    3393974    3448651    3487593
     3446357    3618418    3671920    3712949    3575264    3715385
     3641513    3630897    3691047    3630690    3504933    3662647
     3629926    3443044    3832970    3548813   TLB shootdowns

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/asm-generic/tlb.h |   5 ++
 include/linux/fs.h        |  12 +++-
 include/linux/mm_types.h  |   6 ++
 include/linux/sched.h     |   9 +++
 kernel/sched/core.c       |   1 +
 mm/internal.h             |  94 ++++++++++++++++++++++++-
 mm/memory.c               |  15 ++++
 mm/pgtable-generic.c      |   2 +
 mm/rmap.c                 | 141 +++++++++++++++++++++++++++++++++++---
 mm/truncate.c             |  55 +++++++++++++--
 mm/vmscan.c               |  12 +++-
 11 files changed, 333 insertions(+), 19 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 709830274b756..4a99351be111e 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -549,6 +549,11 @@ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *
 
 static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
+	/*
+	 * Don't leave stale tlb entries for this vma.
+	 */
+	luf_flush(0);
+
 	if (tlb->fullmm)
 		return;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bfd8aaeb78bb8..ec88270221bfe 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -499,8 +499,18 @@ static inline int mapping_write_begin(struct file *file,
 				loff_t pos, unsigned len,
 				struct folio **foliop, void **fsdata)
 {
-	return mapping->a_ops->write_begin(file, mapping, pos, len, foliop,
+	int ret;
+
+	ret = mapping->a_ops->write_begin(file, mapping, pos, len, foliop,
 			fsdata);
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	if (!ret)
+		luf_flush(0);
+
+	return ret;
 }
 
 static inline int mapping_write_end(struct file *file,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 39a6b5124b01f..b3eb5a4e45efb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1270,6 +1270,12 @@ extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_finish_mmu(struct mmu_gather *tlb);
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+void luf_flush(unsigned short luf_key);
+#else
+static inline void luf_flush(unsigned short luf_key) {}
+#endif
+
 struct vm_fault;
 
 /**
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a217d6011fdfe..94321d51b91e8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1382,6 +1382,15 @@ struct task_struct {
 	struct tlbflush_unmap_batch	tlb_ubc;
 	struct tlbflush_unmap_batch	tlb_ubc_takeoff;
 	struct tlbflush_unmap_batch	tlb_ubc_ro;
+	struct tlbflush_unmap_batch	tlb_ubc_luf;
+
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+	/*
+	 * whether all the mappings of a folio during unmap are read-only
+	 * so that luf can work on the folio
+	 */
+	bool				can_luf;
+#endif
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 719e0ed1e9761..aea08d8a9e258 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5225,6 +5225,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	if (mm) {
 		membarrier_mm_sync_core_before_usermode(mm);
 		mmdrop_lazy_tlb_sched(mm);
+		luf_flush(0);
 	}
 
 	if (unlikely(prev_state == TASK_DEAD)) {
diff --git a/mm/internal.h b/mm/internal.h
index 0dc374553f9b5..fe4a1c174895f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1604,13 +1604,105 @@ static inline bool non_luf_pages_ok(struct zone *zone)
 
 	return nr_free - nr_luf_pages > min_wm;
 }
-#else
+
+unsigned short fold_unmap_luf(void);
+
+/*
+ * Reset the indicator indicating there are no writable mappings at the
+ * beginning of every rmap traverse for unmap.  luf can work only when
+ * all the mappings are read-only.
+ */
+static inline void can_luf_init(struct folio *f)
+{
+	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC))
+		current->can_luf = false;
+	/*
+	 * Pages might get updated inside buddy.
+	 */
+	else if (want_init_on_free())
+		current->can_luf = false;
+	/*
+	 * Pages might get updated inside buddy.
+	 */
+	else if (!should_skip_kasan_poison(folio_page(f, 0)))
+		current->can_luf = false;
+	/*
+	 * XXX: Remove the constraint once luf handles zone device folio.
+	 */
+	else if (unlikely(folio_is_zone_device(f)))
+		current->can_luf = false;
+	/*
+	 * XXX: Remove the constraint once luf handles hugetlb folio.
+	 */
+	else if (unlikely(folio_test_hugetlb(f)))
+		current->can_luf = false;
+	/*
+	 * XXX: Remove the constraint once luf handles large folio.
+	 */
+	else if (unlikely(folio_test_large(f)))
+		current->can_luf = false;
+	/*
+	 * Can track write of anon folios through fault handler.
+	 */
+	else if (folio_test_anon(f))
+		current->can_luf = true;
+	/*
+	 * Can track write of file folios through page cache or truncation.
+	 */
+	else if (folio_mapping(f))
+		current->can_luf = true;
+	/*
+	 * For niehter anon nor file folios, do not apply luf.
+	 */
+	else
+		current->can_luf = false;
+}
+
+/*
+ * Mark the folio is not applicable to luf once it found a writble or
+ * dirty pte during rmap traverse for unmap.
+ */
+static inline void can_luf_fail(void)
+{
+	current->can_luf = false;
+}
+
+/*
+ * Check if all the mappings are read-only.
+ */
+static inline bool can_luf_test(void)
+{
+	return current->can_luf;
+}
+
+static inline bool can_luf_vma(struct vm_area_struct *vma)
+{
+	/*
+	 * Shared region requires a medium like file to keep all the
+	 * associated mm_struct.  luf makes use of strcut address_space
+	 * for that purpose.
+	 */
+	if (vma->vm_flags & VM_SHARED)
+		return !!vma->vm_file;
+
+	/*
+	 * Private region can be handled through its mm_struct.
+	 */
+	return true;
+}
+#else /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 static inline bool luf_takeoff_start(void) { return false; }
 static inline void luf_takeoff_end(void) {}
 static inline bool luf_takeoff_no_shootdown(void) { return true; }
 static inline bool luf_takeoff_check(struct page *page) { return true; }
 static inline bool luf_takeoff_check_and_fold(struct page *page) { return true; }
 static inline bool non_luf_pages_ok(struct zone *zone) { return true; }
+static inline unsigned short fold_unmap_luf(void) { return 0; }
+
+static inline void can_luf_init(struct folio *f) {}
+static inline void can_luf_fail(void) {}
+static inline bool can_luf_test(void) { return false; }
+static inline bool can_luf_vma(struct vm_area_struct *vma) { return false; }
 #endif
 
 /* pagewalk.c */
diff --git a/mm/memory.c b/mm/memory.c
index 209885a4134f7..0e85c49bc5028 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6081,6 +6081,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	struct mm_struct *mm = vma->vm_mm;
 	vm_fault_t ret;
 	bool is_droppable;
+	bool flush = false;
 
 	__set_current_state(TASK_RUNNING);
 
@@ -6106,6 +6107,14 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 
 	lru_gen_enter_fault(vma);
 
+	/*
+	 * Any potential cases that make pte writable even forcely
+	 * should be considered.
+	 */
+	if (vma->vm_flags & (VM_WRITE | VM_MAYWRITE) ||
+			flags & FAULT_FLAG_WRITE)
+		flush = true;
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
 	else
@@ -6137,6 +6146,12 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 out:
 	mm_account_fault(mm, regs, address, flags, ret);
 
+	/*
+	 * Ensure to clean stale tlb entries for this vma.
+	 */
+	if (flush)
+		luf_flush(0);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 5297dcc38c37a..215d8d93560fd 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -99,6 +99,8 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 	pte = ptep_get_and_clear(mm, address, ptep);
 	if (pte_accessible(mm, pte))
 		flush_tlb_page(vma, address);
+	else
+		luf_flush(0);
 	return pte;
 }
 #endif
diff --git a/mm/rmap.c b/mm/rmap.c
index 3ed6234dd777e..0aaf02b1b34c3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -646,7 +646,7 @@ static atomic_long_t luf_ugen = ATOMIC_LONG_INIT(LUF_UGEN_INIT);
 /*
  * Don't return invalid luf_ugen, zero.
  */
-static unsigned long __maybe_unused new_luf_ugen(void)
+static unsigned long new_luf_ugen(void)
 {
 	unsigned long ugen = atomic_long_inc_return(&luf_ugen);
 
@@ -723,7 +723,7 @@ static atomic_t luf_kgen = ATOMIC_INIT(1);
 /*
  * Don't return invalid luf_key, zero.
  */
-static unsigned short __maybe_unused new_luf_key(void)
+static unsigned short new_luf_key(void)
 {
 	unsigned short luf_key = atomic_inc_return(&luf_kgen);
 
@@ -776,6 +776,7 @@ void try_to_unmap_flush_takeoff(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 
 	if (!tlb_ubc_takeoff->flush_required)
@@ -793,9 +794,72 @@ void try_to_unmap_flush_takeoff(void)
 	if (arch_tlbbatch_done(&tlb_ubc_ro->arch, &tlb_ubc_takeoff->arch))
 		reset_batch(tlb_ubc_ro);
 
+	if (arch_tlbbatch_done(&tlb_ubc_luf->arch, &tlb_ubc_takeoff->arch))
+		reset_batch(tlb_ubc_luf);
+
 	reset_batch(tlb_ubc_takeoff);
 }
 
+/*
+ * Should be called just before try_to_unmap_flush() to optimize the tlb
+ * shootdown using arch_tlbbatch_done().
+ */
+unsigned short fold_unmap_luf(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
+	struct luf_batch *lb;
+	unsigned long new_ugen;
+	unsigned short new_key;
+	unsigned long flags;
+
+	if (!tlb_ubc_luf->flush_required)
+		return 0;
+
+	/*
+	 * fold_unmap_luf() is always followed by try_to_unmap_flush().
+	 */
+	if (arch_tlbbatch_done(&tlb_ubc_luf->arch, &tlb_ubc->arch)) {
+		tlb_ubc_luf->flush_required = false;
+		tlb_ubc_luf->writable = false;
+	}
+
+	/*
+	 * Check again after shrinking.
+	 */
+	if (!tlb_ubc_luf->flush_required)
+		return 0;
+
+	new_ugen = new_luf_ugen();
+	new_key = new_luf_key();
+
+	/*
+	 * Update the next entry of luf_batch table, that is the oldest
+	 * entry among the candidate, hopefully tlb flushes have been
+	 * done for all of the CPUs.
+	 */
+	lb = &luf_batch[new_key];
+	write_lock_irqsave(&lb->lock, flags);
+	__fold_luf_batch(lb, tlb_ubc_luf, new_ugen);
+	write_unlock_irqrestore(&lb->lock, flags);
+
+	reset_batch(tlb_ubc_luf);
+	return new_key;
+}
+
+void luf_flush(unsigned short luf_key)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct luf_batch *lb = &luf_batch[luf_key];
+	unsigned long flags;
+
+	read_lock_irqsave(&lb->lock, flags);
+	fold_batch(tlb_ubc, &lb->batch, false);
+	read_unlock_irqrestore(&lb->lock, flags);
+	try_to_unmap_flush();
+}
+EXPORT_SYMBOL(luf_flush);
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
@@ -806,8 +870,10 @@ void try_to_unmap_flush(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 
 	fold_batch(tlb_ubc, tlb_ubc_ro, true);
+	fold_batch(tlb_ubc, tlb_ubc_luf, true);
 	if (!tlb_ubc->flush_required)
 		return;
 
@@ -820,8 +886,9 @@ void try_to_unmap_flush_dirty(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 
-	if (tlb_ubc->writable || tlb_ubc_ro->writable)
+	if (tlb_ubc->writable || tlb_ubc_ro->writable || tlb_ubc_luf->writable)
 		try_to_unmap_flush();
 }
 
@@ -836,7 +903,8 @@ void try_to_unmap_flush_dirty(void)
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
-				      unsigned long uaddr)
+				      unsigned long uaddr,
+				      struct vm_area_struct *vma)
 {
 	struct tlbflush_unmap_batch *tlb_ubc;
 	int batch;
@@ -845,7 +913,16 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 	if (!pte_accessible(mm, pteval))
 		return;
 
-	if (pte_write(pteval))
+	if (can_luf_test()) {
+		/*
+		 * luf cannot work with the folio once it found a
+		 * writable or dirty mapping on it.
+		 */
+		if (pte_write(pteval) || !can_luf_vma(vma))
+			can_luf_fail();
+	}
+
+	if (!can_luf_test())
 		tlb_ubc = &current->tlb_ubc;
 	else
 		tlb_ubc = &current->tlb_ubc_ro;
@@ -853,6 +930,21 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
+	if (can_luf_test()) {
+		struct luf_batch *lb;
+		unsigned long flags;
+
+		/*
+		 * Accumulate to the 0th entry right away so that
+		 * luf_flush(0) can be uesed to properly perform pending
+		 * TLB flush once this unmapping is observed.
+		 */
+		lb = &luf_batch[0];
+		write_lock_irqsave(&lb->lock, flags);
+		__fold_luf_batch(lb, tlb_ubc, new_luf_ugen());
+		write_unlock_irqrestore(&lb->lock, flags);
+	}
+
 	/*
 	 * Ensure compiler does not re-order the setting of tlb_flush_batched
 	 * before the PTE is cleared.
@@ -907,6 +999,8 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
  * This must be called under the PTL so that an access to tlb_flush_batched
  * that is potentially a "reclaim vs mprotect/munmap/etc" race will synchronise
  * via the PTL.
+ *
+ * LUF(Lazy Unmap Flush) also relies on this for mprotect/munmap/etc.
  */
 void flush_tlb_batched_pending(struct mm_struct *mm)
 {
@@ -916,6 +1010,7 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 
 	if (pending != flushed) {
 		arch_flush_tlb_batched_pending(mm);
+
 		/*
 		 * If the new TLB flushing is pending during flushing, leave
 		 * mm->tlb_flush_batched as is, to avoid losing flushing.
@@ -926,7 +1021,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 }
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
-				      unsigned long uaddr)
+				      unsigned long uaddr,
+				      struct vm_area_struct *vma)
 {
 }
 
@@ -1292,6 +1388,11 @@ int folio_mkclean(struct folio *folio)
 
 	rmap_walk(folio, &rwc);
 
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
+
 	return cleaned;
 }
 EXPORT_SYMBOL_GPL(folio_mkclean);
@@ -1961,7 +2062,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pteval, address);
+				set_tlb_ubc_flush_pending(mm, pteval, address, vma);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
@@ -2132,6 +2233,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 	mmu_notifier_invalidate_range_end(&range);
 
+	if (!ret)
+		can_luf_fail();
 	return ret;
 }
 
@@ -2164,11 +2267,21 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags)
 		.done = folio_not_mapped,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
+
+	can_luf_init(folio);
 
 	if (flags & TTU_RMAP_LOCKED)
 		rmap_walk_locked(folio, &rwc);
 	else
 		rmap_walk(folio, &rwc);
+
+	if (can_luf_test())
+		fold_batch(tlb_ubc_luf, tlb_ubc_ro, true);
+	else
+		fold_batch(tlb_ubc, tlb_ubc_ro, true);
 }
 
 /*
@@ -2338,7 +2451,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pteval, address);
+				set_tlb_ubc_flush_pending(mm, pteval, address, vma);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
@@ -2494,6 +2607,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 
 	mmu_notifier_invalidate_range_end(&range);
 
+	if (!ret)
+		can_luf_fail();
 	return ret;
 }
 
@@ -2513,6 +2628,9 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 		.done = folio_not_mapped,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 
 	/*
 	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
@@ -2537,10 +2655,17 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 	if (!folio_test_ksm(folio) && folio_test_anon(folio))
 		rwc.invalid_vma = invalid_migration_vma;
 
+	can_luf_init(folio);
+
 	if (flags & TTU_RMAP_LOCKED)
 		rmap_walk_locked(folio, &rwc);
 	else
 		rmap_walk(folio, &rwc);
+
+	if (can_luf_test())
+		fold_batch(tlb_ubc_luf, tlb_ubc_ro, true);
+	else
+		fold_batch(tlb_ubc, tlb_ubc_ro, true);
 }
 
 #ifdef CONFIG_DEVICE_PRIVATE
diff --git a/mm/truncate.c b/mm/truncate.c
index e5151703ba04a..14618c53f1910 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -124,6 +124,11 @@ void folio_invalidate(struct folio *folio, size_t offset, size_t length)
 
 	if (aops->invalidate_folio)
 		aops->invalidate_folio(folio, offset, length);
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 }
 EXPORT_SYMBOL_GPL(folio_invalidate);
 
@@ -161,6 +166,11 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
 
 	truncate_cleanup_folio(folio);
 	filemap_remove_folio(folio);
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 	return 0;
 }
 
@@ -206,6 +216,12 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 
 	if (folio_needs_release(folio))
 		folio_invalidate(folio, offset, length);
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
+
 	if (!folio_test_large(folio))
 		return true;
 	if (split_folio(folio) == 0)
@@ -247,19 +263,28 @@ EXPORT_SYMBOL(generic_error_remove_folio);
  */
 long mapping_evict_folio(struct address_space *mapping, struct folio *folio)
 {
+	long ret = 0;
+
 	/* The page may have been truncated before it was locked */
 	if (!mapping)
-		return 0;
+		goto out;
 	if (folio_test_dirty(folio) || folio_test_writeback(folio))
-		return 0;
+		goto out;
 	/* The refcount will be elevated if any page in the folio is mapped */
 	if (folio_ref_count(folio) >
 			folio_nr_pages(folio) + folio_has_private(folio) + 1)
-		return 0;
+		goto out;
 	if (!filemap_release_folio(folio, 0))
-		return 0;
+		goto out;
 
-	return remove_mapping(mapping, folio);
+	ret = remove_mapping(mapping, folio);
+out:
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
+
+	return ret;
 }
 
 /**
@@ -299,7 +324,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	bool		same_folio;
 
 	if (mapping_empty(mapping))
-		return;
+		goto out;
 
 	/*
 	 * 'start' and 'end' always covers the range of pages to be fully
@@ -387,6 +412,12 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		truncate_folio_batch_exceptionals(mapping, &fbatch, indices);
 		folio_batch_release(&fbatch);
 	}
+
+out:
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
 
@@ -502,6 +533,11 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 		folio_batch_release(&fbatch);
 		cond_resched();
 	}
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 	return count;
 }
 
@@ -594,7 +630,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	int did_range_unmap = 0;
 
 	if (mapping_empty(mapping))
-		return 0;
+		goto out;
 
 	folio_batch_init(&fbatch);
 	index = start;
@@ -664,6 +700,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	if (dax_mapping(mapping)) {
 		unmap_mapping_pages(mapping, start, end - start + 1, false);
 	}
+out:
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2970a8f35d3d3..ffc4a48710f1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -821,6 +821,8 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
  */
 long remove_mapping(struct address_space *mapping, struct folio *folio)
 {
+	long ret = 0;
+
 	if (__remove_mapping(mapping, folio, false, NULL)) {
 		/*
 		 * Unfreezing the refcount with 1 effectively
@@ -828,9 +830,15 @@ long remove_mapping(struct address_space *mapping, struct folio *folio)
 		 * atomic operation.
 		 */
 		folio_ref_unfreeze(folio, 1);
-		return folio_nr_pages(folio);
+		ret = folio_nr_pages(folio);
 	}
-	return 0;
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
+
+	return ret;
 }
 
 /**
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 17/26] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (15 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 16/26] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 18/26] mm/page_alloc: retry 3 times to take pcp pages on luf check failure Byungchul Park
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

luf mechanism performs tlb shootdown for mappings that have been
unmapped in lazy manner.  However, it doesn't have to perform tlb
shootdown to cpus that already have been done by others since the tlb
shootdown was desired.

Since luf already introduced its own generation number used as a global
timestamp, luf_ugen, it's possible to selectively pick cpus that have
been done tlb flush required.

This patch introduced APIs that use the generation number to select and
remove those cpus so that it can perform tlb shootdown with a smaller
cpumask, for all the CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH archs,
x86, riscv, and arm64.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/arm64/include/asm/tlbflush.h |  26 +++++++
 arch/riscv/include/asm/tlbflush.h |   4 ++
 arch/riscv/mm/tlbflush.c          | 108 ++++++++++++++++++++++++++++++
 arch/x86/include/asm/tlbflush.h   |   4 ++
 arch/x86/mm/tlb.c                 | 108 ++++++++++++++++++++++++++++++
 include/linux/sched.h             |   1 +
 mm/internal.h                     |   4 ++
 mm/page_alloc.c                   |  32 +++++++--
 mm/rmap.c                         |  46 ++++++++++++-
 9 files changed, 327 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index a62e1ea61e4af..f8290bec32e01 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -354,6 +354,32 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	dsb(ish);
 }
 
+static inline bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen)
+{
+	/*
+	 * Nothing is needed in this architecture.
+	 */
+	return true;
+}
+
+static inline bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen)
+{
+	/*
+	 * Nothing is needed in this architecture.
+	 */
+	return true;
+}
+
+static inline void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen)
+{
+	/* nothing to do */
+}
+
+static inline void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
+{
+	/* nothing to do */
+}
+
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
 	/* nothing to do */
diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index 1dc7d30273d59..ec5caeb3cf8ef 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -65,6 +65,10 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 			       unsigned long uaddr);
 void arch_flush_tlb_batched_pending(struct mm_struct *mm);
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen);
 
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 36f996af6256c..93afb7a299003 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -202,3 +202,111 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	__flush_tlb_range(&batch->cpumask, FLUSH_TLB_NO_ASID, 0,
 			  FLUSH_TLB_MAX_SIZE, PAGE_SIZE);
 }
+
+static DEFINE_PER_CPU(atomic_long_t, ugen_done);
+
+static int __init luf_init_arch(void)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpu_possible_mask)
+		atomic_long_set(per_cpu_ptr(&ugen_done, cpu), LUF_UGEN_INIT - 1);
+
+	return 0;
+}
+early_initcall(luf_init_arch);
+
+/*
+ * batch will not be updated.
+ */
+bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch,
+			unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		goto out;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		unsigned long done;
+
+		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
+		if (ugen_before(done, ugen))
+			return false;
+	}
+	return true;
+out:
+	return cpumask_empty(&batch->cpumask);
+}
+
+bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
+			unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		goto out;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		unsigned long done;
+
+		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
+		if (!ugen_before(done, ugen))
+			cpumask_clear_cpu(cpu, &batch->cpumask);
+	}
+out:
+	return cpumask_empty(&batch->cpumask);
+}
+
+void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
+			     unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		return;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
+		unsigned long old = atomic_long_read(done);
+
+		/*
+		 * It's racy.  The race results in unnecessary tlb flush
+		 * because of the smaller ugen_done than it should be.
+		 * However, it's okay in terms of correctness.
+		 */
+		if (!ugen_before(old, ugen))
+			continue;
+
+		/*
+		 * It's for optimization.  Just skip on fail than retry.
+		 */
+		atomic_long_cmpxchg(done, old, ugen);
+	}
+}
+
+void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
+		unsigned long old = atomic_long_read(done);
+
+		/*
+		 * It's racy.  The race results in unnecessary tlb flush
+		 * because of the smaller ugen_done than it should be.
+		 * However, it's okay in terms of correctness.
+		 */
+		if (!ugen_before(old, ugen))
+			continue;
+
+		/*
+		 * It's for optimization.  Just skip on fail than retry.
+		 */
+		atomic_long_cmpxchg(done, old, ugen);
+	}
+}
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 0ae9564c7301e..1fc5bacd72dff 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -293,6 +293,10 @@ static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 }
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+extern bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+extern void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+extern void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen);
 
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 860e49b223fd7..975f58fa4b30f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1240,6 +1240,114 @@ void __flush_tlb_all(void)
 }
 EXPORT_SYMBOL_GPL(__flush_tlb_all);
 
+static DEFINE_PER_CPU(atomic_long_t, ugen_done);
+
+static int __init luf_init_arch(void)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpu_possible_mask)
+		atomic_long_set(per_cpu_ptr(&ugen_done, cpu), LUF_UGEN_INIT - 1);
+
+	return 0;
+}
+early_initcall(luf_init_arch);
+
+/*
+ * batch will not be updated.
+ */
+bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch,
+			unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		goto out;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		unsigned long done;
+
+		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
+		if (ugen_before(done, ugen))
+			return false;
+	}
+	return true;
+out:
+	return cpumask_empty(&batch->cpumask);
+}
+
+bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
+			unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		goto out;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		unsigned long done;
+
+		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
+		if (!ugen_before(done, ugen))
+			cpumask_clear_cpu(cpu, &batch->cpumask);
+	}
+out:
+	return cpumask_empty(&batch->cpumask);
+}
+
+void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
+			     unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		return;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
+		unsigned long old = atomic_long_read(done);
+
+		/*
+		 * It's racy.  The race results in unnecessary tlb flush
+		 * because of the smaller ugen_done than it should be.
+		 * However, it's okay in terms of correctness.
+		 */
+		if (!ugen_before(old, ugen))
+			continue;
+
+		/*
+		 * It's for optimization.  Just skip on fail than retry.
+		 */
+		atomic_long_cmpxchg(done, old, ugen);
+	}
+}
+
+void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
+		unsigned long old = atomic_long_read(done);
+
+		/*
+		 * It's racy.  The race results in unnecessary tlb flush
+		 * because of the smaller ugen_done than it should be.
+		 * However, it's okay in terms of correctness.
+		 */
+		if (!ugen_before(old, ugen))
+			continue;
+
+		/*
+		 * It's for optimization.  Just skip on fail than retry.
+		 */
+		atomic_long_cmpxchg(done, old, ugen);
+	}
+}
+
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	struct flush_tlb_info *info;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 94321d51b91e8..5c6c4fd021973 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1377,6 +1377,7 @@ struct task_struct {
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 	int luf_no_shootdown;
 	int luf_takeoff_started;
+	unsigned long luf_ugen;
 #endif
 
 	struct tlbflush_unmap_batch	tlb_ubc;
diff --git a/mm/internal.h b/mm/internal.h
index fe4a1c174895f..77657c17af204 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1246,6 +1246,7 @@ void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
 void try_to_unmap_flush_takeoff(void);
 void flush_tlb_batched_pending(struct mm_struct *mm);
+void reset_batch(struct tlbflush_unmap_batch *batch);
 void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset);
 void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src);
 #else
@@ -1261,6 +1262,9 @@ static inline void try_to_unmap_flush_takeoff(void)
 static inline void flush_tlb_batched_pending(struct mm_struct *mm)
 {
 }
+static inline void reset_batch(struct tlbflush_unmap_batch *batch)
+{
+}
 static inline void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset)
 {
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 65acc437d8387..3032fedd8392b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -668,9 +668,11 @@ bool luf_takeoff_start(void)
  */
 void luf_takeoff_end(void)
 {
+	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 	unsigned long flags;
 	bool no_shootdown;
 	bool outmost = false;
+	unsigned long cur_luf_ugen;
 
 	local_irq_save(flags);
 	VM_WARN_ON(!current->luf_takeoff_started);
@@ -697,10 +699,19 @@ void luf_takeoff_end(void)
 	if (no_shootdown)
 		goto out;
 
+	cur_luf_ugen = current->luf_ugen;
+
+	current->luf_ugen = 0;
+
+	if (cur_luf_ugen && arch_tlbbatch_diet(&tlb_ubc_takeoff->arch, cur_luf_ugen))
+		reset_batch(tlb_ubc_takeoff);
+
 	try_to_unmap_flush_takeoff();
 out:
-	if (outmost)
+	if (outmost) {
 		VM_WARN_ON(current->luf_no_shootdown);
+		VM_WARN_ON(current->luf_ugen);
+	}
 }
 
 /*
@@ -757,6 +768,7 @@ bool luf_takeoff_check_and_fold(struct page *page)
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 	unsigned short luf_key = page_luf_key(page);
 	struct luf_batch *lb;
+	unsigned long lb_ugen;
 	unsigned long flags;
 
 	/*
@@ -770,13 +782,25 @@ bool luf_takeoff_check_and_fold(struct page *page)
 	if (!luf_key)
 		return true;
 
-	if (current->luf_no_shootdown)
-		return false;
-
 	lb = &luf_batch[luf_key];
 	read_lock_irqsave(&lb->lock, flags);
+	lb_ugen = lb->ugen;
+
+	if (arch_tlbbatch_check_done(&lb->batch.arch, lb_ugen)) {
+		read_unlock_irqrestore(&lb->lock, flags);
+		return true;
+	}
+
+	if (current->luf_no_shootdown) {
+		read_unlock_irqrestore(&lb->lock, flags);
+		return false;
+	}
+
 	fold_batch(tlb_ubc_takeoff, &lb->batch, false);
 	read_unlock_irqrestore(&lb->lock, flags);
+
+	if (!current->luf_ugen || ugen_before(current->luf_ugen, lb_ugen))
+		current->luf_ugen = lb_ugen;
 	return true;
 }
 #endif
diff --git a/mm/rmap.c b/mm/rmap.c
index 0aaf02b1b34c3..cf6667fb18fe2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -656,7 +656,7 @@ static unsigned long new_luf_ugen(void)
 	return ugen;
 }
 
-static void reset_batch(struct tlbflush_unmap_batch *batch)
+void reset_batch(struct tlbflush_unmap_batch *batch)
 {
 	arch_tlbbatch_clear(&batch->arch);
 	batch->flush_required = false;
@@ -743,8 +743,14 @@ static void __fold_luf_batch(struct luf_batch *dst_lb,
 	 * more tlb shootdown might be needed to fulfill the newer
 	 * request.  Conservertively keep the newer one.
 	 */
-	if (!dst_lb->ugen || ugen_before(dst_lb->ugen, src_ugen))
+	if (!dst_lb->ugen || ugen_before(dst_lb->ugen, src_ugen)) {
+		/*
+		 * Good chance to shrink the batch using the old ugen.
+		 */
+		if (dst_lb->ugen && arch_tlbbatch_diet(&dst_lb->batch.arch, dst_lb->ugen))
+			reset_batch(&dst_lb->batch);
 		dst_lb->ugen = src_ugen;
+	}
 	fold_batch(&dst_lb->batch, src_batch, false);
 }
 
@@ -772,17 +778,45 @@ void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
 	read_unlock_irqrestore(&src->lock, flags);
 }
 
+static unsigned long tlb_flush_start(void)
+{
+	/*
+	 * Memory barrier implied in the atomic operation prevents
+	 * reading luf_ugen from happening after the following
+	 * tlb flush.
+	 */
+	return new_luf_ugen();
+}
+
+static void tlb_flush_end(struct arch_tlbflush_unmap_batch *arch,
+		struct mm_struct *mm, unsigned long ugen)
+{
+	/*
+	 * Prevent the following marking from placing prior to the
+	 * actual tlb flush.
+	 */
+	smp_mb();
+
+	if (arch)
+		arch_tlbbatch_mark_ugen(arch, ugen);
+	if (mm)
+		arch_mm_mark_ugen(mm, ugen);
+}
+
 void try_to_unmap_flush_takeoff(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
+	unsigned long ugen;
 
 	if (!tlb_ubc_takeoff->flush_required)
 		return;
 
+	ugen = tlb_flush_start();
 	arch_tlbbatch_flush(&tlb_ubc_takeoff->arch);
+	tlb_flush_end(&tlb_ubc_takeoff->arch, NULL, ugen);
 
 	/*
 	 * Now that tlb shootdown of tlb_ubc_takeoff has been performed,
@@ -871,13 +905,17 @@ void try_to_unmap_flush(void)
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
+	unsigned long ugen;
 
 	fold_batch(tlb_ubc, tlb_ubc_ro, true);
 	fold_batch(tlb_ubc, tlb_ubc_luf, true);
 	if (!tlb_ubc->flush_required)
 		return;
 
+	ugen = tlb_flush_start();
 	arch_tlbbatch_flush(&tlb_ubc->arch);
+	tlb_flush_end(&tlb_ubc->arch, NULL, ugen);
+
 	reset_batch(tlb_ubc);
 }
 
@@ -1009,7 +1047,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 	int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
 
 	if (pending != flushed) {
+		unsigned long ugen;
+
+		ugen = tlb_flush_start();
 		arch_flush_tlb_batched_pending(mm);
+		tlb_flush_end(NULL, mm, ugen);
 
 		/*
 		 * If the new TLB flushing is pending during flushing, leave
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 18/26] mm/page_alloc: retry 3 times to take pcp pages on luf check failure
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (16 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 17/26] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 19/26] mm: skip luf tlb flush for luf'd mm that already has been done Byungchul Park
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/page_alloc.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3032fedd8392b..0b6e7f235c4a1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3339,6 +3339,12 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 {
 	struct page *page;
 
+	/*
+	 * give up taking page from pcp if it fails to take pcp page
+	 * 3 times due to the tlb shootdownable issue.
+	 */
+	int try_luf_pages = 3;
+
 	do {
 		if (list_empty(list)) {
 			int batch = nr_pcp_alloc(pcp, zone, order);
@@ -3353,11 +3359,21 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 				return NULL;
 		}
 
-		page = list_first_entry(list, struct page, pcp_list);
-		if (!luf_takeoff_check_and_fold(page))
+		list_for_each_entry(page, list, pcp_list) {
+			if (luf_takeoff_check_and_fold(page)) {
+				list_del(&page->pcp_list);
+				pcp->count -= 1 << order;
+				break;
+			}
+			if (!--try_luf_pages)
+				return NULL;
+		}
+
+		/*
+		 * If all the pages in the list fails...
+		 */
+		if (list_entry_is_head(page, list, pcp_list))
 			return NULL;
-		list_del(&page->pcp_list);
-		pcp->count -= 1 << order;
 	} while (check_new_pages(page, order));
 
 	return page;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 19/26] mm: skip luf tlb flush for luf'd mm that already has been done
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (17 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 18/26] mm/page_alloc: retry 3 times to take pcp pages on luf check failure Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 20/26] mm, fs: skip tlb flushes for luf'd filemap " Byungchul Park
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Fault hander performs tlb flush pended by luf when a new pte becomes
to have write permission, no matter whether tlb flush required has been
performed or not.

By storing luf generation number, luf_ugen, in struct mm_struct, we can
skip unnecessary tlb flush.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/asm-generic/tlb.h |  2 +-
 include/linux/mm_types.h  |  9 +++++
 kernel/fork.c             |  1 +
 kernel/sched/core.c       |  2 +-
 mm/memory.c               | 22 ++++++++++--
 mm/pgtable-generic.c      |  2 +-
 mm/rmap.c                 | 74 +++++++++++++++++++++++++++++++++++++--
 7 files changed, 104 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 4a99351be111e..94b329a5127a7 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -552,7 +552,7 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
 	/*
 	 * Don't leave stale tlb entries for this vma.
 	 */
-	luf_flush(0);
+	luf_flush_vma(vma);
 
 	if (tlb->fullmm)
 		return;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b3eb5a4e45efb..8de4c190ad514 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -38,8 +38,10 @@ struct luf_batch {
 	unsigned long ugen;
 	rwlock_t lock;
 };
+void luf_batch_init(struct luf_batch *lb);
 #else
 struct luf_batch {};
+static inline void luf_batch_init(struct luf_batch *lb) {}
 #endif
 
 /*
@@ -1022,6 +1024,9 @@ struct mm_struct {
 		 * moving a PROT_NONE mapped page.
 		 */
 		atomic_t tlb_flush_pending;
+
+		/* luf batch for this mm */
+		struct luf_batch luf_batch;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 		/* See flush_tlb_batched_pending() */
 		atomic_t tlb_flush_batched;
@@ -1272,8 +1277,12 @@ extern void tlb_finish_mmu(struct mmu_gather *tlb);
 
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 void luf_flush(unsigned short luf_key);
+void luf_flush_mm(struct mm_struct *mm);
+void luf_flush_vma(struct vm_area_struct *vma);
 #else
 static inline void luf_flush(unsigned short luf_key) {}
+static inline void luf_flush_mm(struct mm_struct *mm) {}
+static inline void luf_flush_vma(struct vm_area_struct *vma) {}
 #endif
 
 struct vm_fault;
diff --git a/kernel/fork.c b/kernel/fork.c
index 0061cf2450efd..593e74235ea8a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1268,6 +1268,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
 	spin_lock_init(&mm->arg_lock);
+	luf_batch_init(&mm->luf_batch);
 	mm_init_cpumask(mm);
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aea08d8a9e258..c7665cb93f617 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5225,7 +5225,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	if (mm) {
 		membarrier_mm_sync_core_before_usermode(mm);
 		mmdrop_lazy_tlb_sched(mm);
-		luf_flush(0);
+		luf_flush_mm(mm);
 	}
 
 	if (unlikely(prev_state == TASK_DEAD)) {
diff --git a/mm/memory.c b/mm/memory.c
index 0e85c49bc5028..b02f86b1adb91 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6081,6 +6081,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	struct mm_struct *mm = vma->vm_mm;
 	vm_fault_t ret;
 	bool is_droppable;
+	struct address_space *mapping = NULL;
 	bool flush = false;
 
 	__set_current_state(TASK_RUNNING);
@@ -6112,9 +6113,17 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	 * should be considered.
 	 */
 	if (vma->vm_flags & (VM_WRITE | VM_MAYWRITE) ||
-			flags & FAULT_FLAG_WRITE)
+			flags & FAULT_FLAG_WRITE) {
 		flush = true;
 
+		/*
+		 * Doesn't care the !VM_SHARED cases because it won't
+		 * update the pages that might be shared with others.
+		 */
+		if (vma->vm_flags & VM_SHARED && vma->vm_file)
+			mapping = vma->vm_file->f_mapping;
+	}
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
 	else
@@ -6149,8 +6158,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/*
 	 * Ensure to clean stale tlb entries for this vma.
 	 */
-	if (flush)
-		luf_flush(0);
+	if (flush) {
+		/*
+		 * If it has a VM_SHARED mapping, all the mms involved
+		 * should be luf_flush'ed.
+		 */
+		if (mapping)
+			luf_flush(0);
+		luf_flush_mm(mm);
+	}
 
 	return ret;
 }
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 215d8d93560fd..5a876c1c93a80 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -100,7 +100,7 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 	if (pte_accessible(mm, pte))
 		flush_tlb_page(vma, address);
 	else
-		luf_flush(0);
+		luf_flush_vma(vma);
 	return pte;
 }
 #endif
diff --git a/mm/rmap.c b/mm/rmap.c
index cf6667fb18fe2..e0304dc74c3a7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -695,7 +695,7 @@ void fold_batch(struct tlbflush_unmap_batch *dst,
  */
 struct luf_batch luf_batch[NR_LUF_BATCH];
 
-static void luf_batch_init(struct luf_batch *lb)
+void luf_batch_init(struct luf_batch *lb)
 {
 	rwlock_init(&lb->lock);
 	reset_batch(&lb->batch);
@@ -778,6 +778,31 @@ void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
 	read_unlock_irqrestore(&src->lock, flags);
 }
 
+static void fold_luf_batch_mm(struct luf_batch *dst,
+		struct mm_struct *mm)
+{
+	unsigned long flags;
+	bool need_fold = false;
+
+	read_lock_irqsave(&dst->lock, flags);
+	if (arch_tlbbatch_need_fold(&dst->batch.arch, mm))
+		need_fold = true;
+	read_unlock(&dst->lock);
+
+	write_lock(&dst->lock);
+	if (unlikely(need_fold))
+		arch_tlbbatch_add_pending(&dst->batch.arch, mm, 0);
+
+	/*
+	 * dst->ugen represents sort of request for tlb shootdown.  The
+	 * newer it is, the more tlb shootdown might be needed to
+	 * fulfill the newer request.  Keep the newest one not to miss
+	 * necessary tlb shootdown.
+	 */
+	dst->ugen = new_luf_ugen();
+	write_unlock_irqrestore(&dst->lock, flags);
+}
+
 static unsigned long tlb_flush_start(void)
 {
 	/*
@@ -894,6 +919,49 @@ void luf_flush(unsigned short luf_key)
 }
 EXPORT_SYMBOL(luf_flush);
 
+void luf_flush_vma(struct vm_area_struct *vma)
+{
+	struct mm_struct *mm;
+	struct address_space *mapping = NULL;
+
+	if (!vma)
+		return;
+
+	mm = vma->vm_mm;
+	/*
+	 * Doesn't care the !VM_SHARED cases because it won't
+	 * update the pages that might be shared with others.
+	 */
+	if (vma->vm_flags & VM_SHARED && vma->vm_file)
+		mapping = vma->vm_file->f_mapping;
+
+	if (mapping)
+		luf_flush(0);
+	luf_flush_mm(mm);
+}
+
+void luf_flush_mm(struct mm_struct *mm)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct luf_batch *lb;
+	unsigned long flags;
+	unsigned long lb_ugen;
+
+	if (!mm)
+		return;
+
+	lb = &mm->luf_batch;
+	read_lock_irqsave(&lb->lock, flags);
+	fold_batch(tlb_ubc, &lb->batch, false);
+	lb_ugen = lb->ugen;
+	read_unlock_irqrestore(&lb->lock, flags);
+
+	if (arch_tlbbatch_diet(&tlb_ubc->arch, lb_ugen))
+		return;
+
+	try_to_unmap_flush();
+}
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
@@ -962,8 +1030,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 
 	if (!can_luf_test())
 		tlb_ubc = &current->tlb_ubc;
-	else
+	else {
 		tlb_ubc = &current->tlb_ubc_ro;
+		fold_luf_batch_mm(&mm->luf_batch, mm);
+	}
 
 	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 20/26] mm, fs: skip tlb flushes for luf'd filemap that already has been done
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (18 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 19/26] mm: skip luf tlb flush for luf'd mm that already has been done Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 21/26] mm: perform luf tlb shootdown per zone in batched manner Byungchul Park
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

For luf'd filemap, tlb shootdown is performed when updating page cache,
no matter whether tlb flushes required already has been done or not.

By storing luf meta data in struct address_space and updating the luf
meta data properly, we can skip unnecessary tlb flush.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 fs/inode.c               |  1 +
 include/linux/fs.h       |  4 ++-
 include/linux/mm_types.h |  2 ++
 mm/memory.c              |  4 +--
 mm/rmap.c                | 59 +++++++++++++++++++++++++---------------
 mm/truncate.c            | 14 +++++-----
 mm/vmscan.c              |  2 +-
 7 files changed, 53 insertions(+), 33 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 46fbd5b234822..e155e51be2d28 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -404,6 +404,7 @@ static void __address_space_init_once(struct address_space *mapping)
 	init_rwsem(&mapping->i_mmap_rwsem);
 	INIT_LIST_HEAD(&mapping->i_private_list);
 	spin_lock_init(&mapping->i_private_lock);
+	luf_batch_init(&mapping->luf_batch);
 	mapping->i_mmap = RB_ROOT_CACHED;
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ec88270221bfe..0cc588c704cd1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -461,6 +461,7 @@ extern const struct address_space_operations empty_aops;
  * @i_private_lock: For use by the owner of the address_space.
  * @i_private_list: For use by the owner of the address_space.
  * @i_private_data: For use by the owner of the address_space.
+ * @luf_batch: Data to track need of tlb flush by luf.
  */
 struct address_space {
 	struct inode		*host;
@@ -482,6 +483,7 @@ struct address_space {
 	struct list_head	i_private_list;
 	struct rw_semaphore	i_mmap_rwsem;
 	void *			i_private_data;
+	struct luf_batch	luf_batch;
 } __attribute__((aligned(sizeof(long)))) __randomize_layout;
 	/*
 	 * On most architectures that alignment is already the case; but
@@ -508,7 +510,7 @@ static inline int mapping_write_begin(struct file *file,
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
 	if (!ret)
-		luf_flush(0);
+		luf_flush_mapping(mapping);
 
 	return ret;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8de4c190ad514..c50cfc1c6282f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1279,10 +1279,12 @@ extern void tlb_finish_mmu(struct mmu_gather *tlb);
 void luf_flush(unsigned short luf_key);
 void luf_flush_mm(struct mm_struct *mm);
 void luf_flush_vma(struct vm_area_struct *vma);
+void luf_flush_mapping(struct address_space *mapping);
 #else
 static inline void luf_flush(unsigned short luf_key) {}
 static inline void luf_flush_mm(struct mm_struct *mm) {}
 static inline void luf_flush_vma(struct vm_area_struct *vma) {}
+static inline void luf_flush_mapping(struct address_space *mapping) {}
 #endif
 
 struct vm_fault;
diff --git a/mm/memory.c b/mm/memory.c
index b02f86b1adb91..c98af5e567e89 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6161,10 +6161,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	if (flush) {
 		/*
 		 * If it has a VM_SHARED mapping, all the mms involved
-		 * should be luf_flush'ed.
+		 * in the struct address_space should be luf_flush'ed.
 		 */
 		if (mapping)
-			luf_flush(0);
+			luf_flush_mapping(mapping);
 		luf_flush_mm(mm);
 	}
 
diff --git a/mm/rmap.c b/mm/rmap.c
index e0304dc74c3a7..0cb13e8fcd739 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -691,7 +691,7 @@ void fold_batch(struct tlbflush_unmap_batch *dst,
 #define NR_LUF_BATCH (1 << (sizeof(short) * 8))
 
 /*
- * Use 0th entry as accumulated batch.
+ * XXX: Reserve the 0th entry for later use.
  */
 struct luf_batch luf_batch[NR_LUF_BATCH];
 
@@ -936,7 +936,7 @@ void luf_flush_vma(struct vm_area_struct *vma)
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping)
-		luf_flush(0);
+		luf_flush_mapping(mapping);
 	luf_flush_mm(mm);
 }
 
@@ -962,6 +962,29 @@ void luf_flush_mm(struct mm_struct *mm)
 	try_to_unmap_flush();
 }
 
+void luf_flush_mapping(struct address_space *mapping)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct luf_batch *lb;
+	unsigned long flags;
+	unsigned long lb_ugen;
+
+	if (!mapping)
+		return;
+
+	lb = &mapping->luf_batch;
+	read_lock_irqsave(&lb->lock, flags);
+	fold_batch(tlb_ubc, &lb->batch, false);
+	lb_ugen = lb->ugen;
+	read_unlock_irqrestore(&lb->lock, flags);
+
+	if (arch_tlbbatch_diet(&tlb_ubc->arch, lb_ugen))
+		return;
+
+	try_to_unmap_flush();
+}
+EXPORT_SYMBOL(luf_flush_mapping);
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
@@ -1010,7 +1033,8 @@ void try_to_unmap_flush_dirty(void)
 
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 				      unsigned long uaddr,
-				      struct vm_area_struct *vma)
+				      struct vm_area_struct *vma,
+				      struct address_space *mapping)
 {
 	struct tlbflush_unmap_batch *tlb_ubc;
 	int batch;
@@ -1032,27 +1056,15 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 		tlb_ubc = &current->tlb_ubc;
 	else {
 		tlb_ubc = &current->tlb_ubc_ro;
+
 		fold_luf_batch_mm(&mm->luf_batch, mm);
+		if (mapping)
+			fold_luf_batch_mm(&mapping->luf_batch, mm);
 	}
 
 	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
-	if (can_luf_test()) {
-		struct luf_batch *lb;
-		unsigned long flags;
-
-		/*
-		 * Accumulate to the 0th entry right away so that
-		 * luf_flush(0) can be uesed to properly perform pending
-		 * TLB flush once this unmapping is observed.
-		 */
-		lb = &luf_batch[0];
-		write_lock_irqsave(&lb->lock, flags);
-		__fold_luf_batch(lb, tlb_ubc, new_luf_ugen());
-		write_unlock_irqrestore(&lb->lock, flags);
-	}
-
 	/*
 	 * Ensure compiler does not re-order the setting of tlb_flush_batched
 	 * before the PTE is cleared.
@@ -1134,7 +1146,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 				      unsigned long uaddr,
-				      struct vm_area_struct *vma)
+				      struct vm_area_struct *vma,
+				      struct address_space *mapping)
 {
 }
 
@@ -1503,7 +1516,7 @@ int folio_mkclean(struct folio *folio)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 
 	return cleaned;
 }
@@ -2037,6 +2050,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long pfn;
 	unsigned long hsz = 0;
+	struct address_space *mapping = folio_mapping(folio);
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2174,7 +2188,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pteval, address, vma);
+				set_tlb_ubc_flush_pending(mm, pteval, address, vma, mapping);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
@@ -2414,6 +2428,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long pfn;
 	unsigned long hsz = 0;
+	struct address_space *mapping = folio_mapping(folio);
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2563,7 +2578,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pteval, address, vma);
+				set_tlb_ubc_flush_pending(mm, pteval, address, vma, mapping);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
diff --git a/mm/truncate.c b/mm/truncate.c
index 14618c53f1910..f9a3416610231 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -128,7 +128,7 @@ void folio_invalidate(struct folio *folio, size_t offset, size_t length)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(folio->mapping);
 }
 EXPORT_SYMBOL_GPL(folio_invalidate);
 
@@ -170,7 +170,7 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 	return 0;
 }
 
@@ -220,7 +220,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(folio->mapping);
 
 	if (!folio_test_large(folio))
 		return true;
@@ -282,7 +282,7 @@ long mapping_evict_folio(struct address_space *mapping, struct folio *folio)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 
 	return ret;
 }
@@ -417,7 +417,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
 
@@ -537,7 +537,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 	return count;
 }
 
@@ -704,7 +704,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ffc4a48710f1d..cbca027d2a10e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -836,7 +836,7 @@ long remove_mapping(struct address_space *mapping, struct folio *folio)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 
 	return ret;
 }
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 21/26] mm: perform luf tlb shootdown per zone in batched manner
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (19 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 20/26] mm, fs: skip tlb flushes for luf'd filemap " Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 22/26] mm/page_alloc: not allow to tlb shootdown if !preemptable() && non_luf_pages_ok() Byungchul Park
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Each luf page in buddy has its pending tlb shootdown information and
performs the corresponding tlb shootdown on exit from buddy.  However,
every exit from buddy causes small but frequent IPIs.  Even though total
IPIs get reduced, unnecessary waits on conflict CPUs in IPI handler have
been observed via perf profiling.

Thus, made it perfrom luf tlb shootdown per zone in batched manner when
pages exit from buddy so as to avoid frequent IPIs.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm.h       |  44 ++++-
 include/linux/mm_types.h |  19 +-
 include/linux/mmzone.h   |   9 +
 include/linux/sched.h    |   2 +
 mm/compaction.c          |  10 +-
 mm/internal.h            |  13 +-
 mm/mm_init.c             |   5 +
 mm/page_alloc.c          | 363 +++++++++++++++++++++++++++++++--------
 mm/page_reporting.c      |   9 +-
 mm/rmap.c                |   6 +-
 10 files changed, 383 insertions(+), 97 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 53a5f1cb21e0d..46638e86e8073 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4161,12 +4161,16 @@ static inline int do_mseal(unsigned long start, size_t len_in, unsigned long fla
 }
 #endif
 
-#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 /*
  * luf_ugen will start with 2 so that 1 can be regarded as a passed one.
  */
 #define LUF_UGEN_INIT 2
+/*
+ * zone_ugen will start with 2 so that 1 can be regarded as done.
+ */
+#define ZONE_UGEN_INIT 2
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 static inline bool ugen_before(unsigned long a, unsigned long b)
 {
 	/*
@@ -4177,7 +4181,11 @@ static inline bool ugen_before(unsigned long a, unsigned long b)
 
 static inline unsigned long next_ugen(unsigned long ugen)
 {
-	if (ugen + 1)
+	/*
+	 * Avoid zero even in unsigned short range so as to treat
+	 * '(unsigned short)ugen == 0' as invalid.
+	 */
+	if ((unsigned short)(ugen + 1))
 		return ugen + 1;
 	/*
 	 * Avoid invalid ugen, zero.
@@ -4187,7 +4195,11 @@ static inline unsigned long next_ugen(unsigned long ugen)
 
 static inline unsigned long prev_ugen(unsigned long ugen)
 {
-	if (ugen - 1)
+	/*
+	 * Avoid zero even in unsigned short range so as to treat
+	 * '(unsigned short)ugen == 0' as invalid.
+	 */
+	if ((unsigned short)(ugen - 1))
 		return ugen - 1;
 	/*
 	 * Avoid invalid ugen, zero.
@@ -4195,4 +4207,30 @@ static inline unsigned long prev_ugen(unsigned long ugen)
 	return ugen - 2;
 }
 #endif
+
+/*
+ * return the biggest ugen but it should be before the real zone_ugen.
+ */
+static inline unsigned long page_zone_ugen(struct zone *zone, struct page *page)
+{
+	unsigned long zone_ugen = zone->zone_ugen;
+	unsigned short short_zone_ugen = page->zone_ugen;
+	unsigned long cand1, cand2;
+
+	if (!short_zone_ugen)
+		return 0;
+
+	cand1 = (zone_ugen & ~(unsigned long)USHRT_MAX) | short_zone_ugen;
+	cand2 = cand1 - USHRT_MAX - 1;
+
+	if (!ugen_before(zone_ugen, cand1))
+		return cand1;
+
+	return cand2;
+}
+
+static inline void set_page_zone_ugen(struct page *page, unsigned short zone_ugen)
+{
+	page->zone_ugen = zone_ugen;
+}
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c50cfc1c6282f..e3132e1e5e5d2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -132,11 +132,20 @@ struct page {
 					 */
 					unsigned short order;
 
-					/*
-					 * For tracking need of tlb flush,
-					 * by luf(lazy unmap flush).
-					 */
-					unsigned short luf_key;
+					union {
+						/*
+						 * For tracking need of
+						 * tlb flush, by
+						 * luf(lazy unmap flush).
+						 */
+						unsigned short luf_key;
+
+						/*
+						 * Casted zone_ugen with
+						 * unsigned short.
+						 */
+						unsigned short zone_ugen;
+					};
 				};
 			};
 		};
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ac3178b5fc50b..3c1b04d21fda9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -117,6 +117,7 @@ extern int page_group_by_mobility_disabled;
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	struct list_head	pend_list[MIGRATE_TYPES];
+	unsigned long		pend_zone_ugen[MIGRATE_TYPES];
 	unsigned long		nr_free;
 };
 
@@ -998,6 +999,14 @@ struct zone {
 	atomic_long_t		vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
 	/* Count pages that need tlb shootdown on allocation */
 	atomic_long_t		nr_luf_pages;
+	/* Generation number for that tlb shootdown has been done */
+	unsigned long		zone_ugen_done;
+	/* Generation number to control zone batched tlb shootdown */
+	unsigned long		zone_ugen;
+	/* Approximate latest luf_ugen that have ever entered */
+	unsigned long		luf_ugen;
+	/* Accumulated tlb batch for this zone */
+	struct tlbflush_unmap_batch zone_batch;
 } ____cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5c6c4fd021973..463cb2fb8f919 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1378,6 +1378,8 @@ struct task_struct {
 	int luf_no_shootdown;
 	int luf_takeoff_started;
 	unsigned long luf_ugen;
+	unsigned long zone_ugen;
+	unsigned long wait_zone_ugen;
 #endif
 
 	struct tlbflush_unmap_batch	tlb_ubc;
diff --git a/mm/compaction.c b/mm/compaction.c
index 27f3d743762bb..a7f17867decae 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -653,7 +653,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 				goto isolate_fail;
 		}
 
-		if (!luf_takeoff_check(page))
+		if (!luf_takeoff_check(cc->zone, page))
 			goto isolate_fail;
 
 		/* Found a free page, will break it into order-0 pages */
@@ -689,7 +689,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(cc->zone);
 
 	/*
 	 * Be careful to not go outside of the pageblock.
@@ -1611,7 +1611,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 			order_scanned++;
 			nr_scanned++;
 
-			if (unlikely(consider_pend && !luf_takeoff_check(freepage)))
+			if (unlikely(consider_pend && !luf_takeoff_check(cc->zone, freepage)))
 				goto scan_next;
 
 			pfn = page_to_pfn(freepage);
@@ -1679,7 +1679,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 		/*
 		 * Check and flush before using the pages taken off.
 		 */
-		luf_takeoff_end();
+		luf_takeoff_end(cc->zone);
 
 		/* Skip fast search if enough freepages isolated */
 		if (cc->nr_freepages >= cc->nr_migratepages)
@@ -2415,7 +2415,7 @@ static enum compact_result compact_finished(struct compact_control *cc)
 	 */
 	luf_takeoff_start();
 	ret = __compact_finished(cc);
-	luf_takeoff_end();
+	luf_takeoff_end(cc->zone);
 
 	trace_mm_compaction_finished(cc->zone, cc->order, ret);
 	if (ret == COMPACT_NO_SUITABLE_PAGE)
diff --git a/mm/internal.h b/mm/internal.h
index 77657c17af204..e634eaf220f00 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1595,10 +1595,10 @@ static inline void accept_page(struct page *page)
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 extern struct luf_batch luf_batch[];
 bool luf_takeoff_start(void);
-void luf_takeoff_end(void);
+void luf_takeoff_end(struct zone *zone);
 bool luf_takeoff_no_shootdown(void);
-bool luf_takeoff_check(struct page *page);
-bool luf_takeoff_check_and_fold(struct page *page);
+bool luf_takeoff_check(struct zone *zone, struct page *page);
+bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page);
 
 static inline bool non_luf_pages_ok(struct zone *zone)
 {
@@ -1608,7 +1608,6 @@ static inline bool non_luf_pages_ok(struct zone *zone)
 
 	return nr_free - nr_luf_pages > min_wm;
 }
-
 unsigned short fold_unmap_luf(void);
 
 /*
@@ -1696,10 +1695,10 @@ static inline bool can_luf_vma(struct vm_area_struct *vma)
 }
 #else /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 static inline bool luf_takeoff_start(void) { return false; }
-static inline void luf_takeoff_end(void) {}
+static inline void luf_takeoff_end(struct zone *zone) {}
 static inline bool luf_takeoff_no_shootdown(void) { return true; }
-static inline bool luf_takeoff_check(struct page *page) { return true; }
-static inline bool luf_takeoff_check_and_fold(struct page *page) { return true; }
+static inline bool luf_takeoff_check(struct zone *zone, struct page *page) { return true; }
+static inline bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page) { return true; }
 static inline bool non_luf_pages_ok(struct zone *zone) { return true; }
 static inline unsigned short fold_unmap_luf(void) { return 0; }
 
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 12b96cd6a87b0..58e616ceef52a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1397,6 +1397,7 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		INIT_LIST_HEAD(&zone->free_area[order].pend_list[t]);
+		zone->free_area[order].pend_zone_ugen[t] = ZONE_UGEN_INIT;
 		zone->free_area[order].nr_free = 0;
 	}
 
@@ -1404,6 +1405,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	INIT_LIST_HEAD(&zone->unaccepted_pages);
 #endif
 	atomic_long_set(&zone->nr_luf_pages, 0);
+	zone->zone_ugen_done = ZONE_UGEN_INIT - 1;
+	zone->zone_ugen = ZONE_UGEN_INIT;
+	zone->luf_ugen = LUF_UGEN_INIT - 1;
+	reset_batch(&zone->zone_batch);
 }
 
 void __meminit init_currently_empty_zone(struct zone *zone,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0b6e7f235c4a1..b81931c6f2cfd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -663,16 +663,29 @@ bool luf_takeoff_start(void)
 	return !no_shootdown;
 }
 
+static void wait_zone_ugen_done(struct zone *zone, unsigned long zone_ugen)
+{
+	while (ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen))
+		cond_resched();
+}
+
+static void set_zone_ugen_done(struct zone *zone, unsigned long zone_ugen)
+{
+	WRITE_ONCE(zone->zone_ugen_done, zone_ugen);
+}
+
 /*
  * Should be called within the same context of luf_takeoff_start().
  */
-void luf_takeoff_end(void)
+void luf_takeoff_end(struct zone *zone)
 {
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 	unsigned long flags;
 	bool no_shootdown;
 	bool outmost = false;
 	unsigned long cur_luf_ugen;
+	unsigned long cur_zone_ugen;
+	unsigned long cur_wait_zone_ugen;
 
 	local_irq_save(flags);
 	VM_WARN_ON(!current->luf_takeoff_started);
@@ -700,6 +713,8 @@ void luf_takeoff_end(void)
 		goto out;
 
 	cur_luf_ugen = current->luf_ugen;
+	cur_zone_ugen = current->zone_ugen;
+	cur_wait_zone_ugen = current->wait_zone_ugen;
 
 	current->luf_ugen = 0;
 
@@ -707,10 +722,38 @@ void luf_takeoff_end(void)
 		reset_batch(tlb_ubc_takeoff);
 
 	try_to_unmap_flush_takeoff();
+
+	if (cur_wait_zone_ugen || cur_zone_ugen) {
+		/*
+		 * pcp(zone == NULL) doesn't work with zone batch.
+		 */
+		if (zone) {
+			current->zone_ugen = 0;
+			current->wait_zone_ugen = 0;
+
+			/*
+			 * Guarantee that tlb shootdown required for the
+			 * zone_ugen has been completed once observing
+			 * 'zone_ugen_done'.
+			 */
+			smp_mb();
+
+			/*
+			 * zone->zone_ugen_done should be updated
+			 * sequentially.
+			 */
+			if (cur_wait_zone_ugen)
+				wait_zone_ugen_done(zone, cur_wait_zone_ugen);
+			if (cur_zone_ugen)
+				set_zone_ugen_done(zone, cur_zone_ugen);
+		}
+	}
 out:
 	if (outmost) {
 		VM_WARN_ON(current->luf_no_shootdown);
 		VM_WARN_ON(current->luf_ugen);
+		VM_WARN_ON(current->zone_ugen);
+		VM_WARN_ON(current->wait_zone_ugen);
 	}
 }
 
@@ -741,9 +784,9 @@ bool luf_takeoff_no_shootdown(void)
  * Should be called with either zone lock held and irq disabled or pcp
  * lock held.
  */
-bool luf_takeoff_check(struct page *page)
+bool luf_takeoff_check(struct zone *zone, struct page *page)
 {
-	unsigned short luf_key = page_luf_key(page);
+	unsigned long zone_ugen;
 
 	/*
 	 * No way.  Delimit using luf_takeoff_{start,end}().
@@ -753,7 +796,29 @@ bool luf_takeoff_check(struct page *page)
 		return false;
 	}
 
-	if (!luf_key)
+	if (!zone) {
+		unsigned short luf_key = page_luf_key(page);
+
+		if (!luf_key)
+			return true;
+
+		if (current->luf_no_shootdown)
+			return false;
+
+		return true;
+	}
+
+	zone_ugen = page_zone_ugen(zone, page);
+	if (!zone_ugen)
+		return true;
+
+	/*
+	 * Should not be zero since zone-zone_ugen has been updated in
+	 * __free_one_page() -> update_zone_batch().
+	 */
+	VM_WARN_ON(!zone->zone_ugen);
+
+	if (!ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen))
 		return true;
 
 	return !current->luf_no_shootdown;
@@ -763,13 +828,11 @@ bool luf_takeoff_check(struct page *page)
  * Should be called with either zone lock held and irq disabled or pcp
  * lock held.
  */
-bool luf_takeoff_check_and_fold(struct page *page)
+bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 {
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
-	unsigned short luf_key = page_luf_key(page);
-	struct luf_batch *lb;
-	unsigned long lb_ugen;
 	unsigned long flags;
+	unsigned long zone_ugen;
 
 	/*
 	 * No way.  Delimit using luf_takeoff_{start,end}().
@@ -779,28 +842,94 @@ bool luf_takeoff_check_and_fold(struct page *page)
 		return false;
 	}
 
-	if (!luf_key)
-		return true;
+	/*
+	 * pcp case
+	 */
+	if (!zone) {
+		unsigned short luf_key = page_luf_key(page);
+		struct luf_batch *lb;
+		unsigned long lb_ugen;
 
-	lb = &luf_batch[luf_key];
-	read_lock_irqsave(&lb->lock, flags);
-	lb_ugen = lb->ugen;
+		if (!luf_key)
+			return true;
+
+		lb = &luf_batch[luf_key];
+		read_lock_irqsave(&lb->lock, flags);
+		lb_ugen = lb->ugen;
+
+		if (arch_tlbbatch_check_done(&lb->batch.arch, lb_ugen)) {
+			read_unlock_irqrestore(&lb->lock, flags);
+			return true;
+		}
+
+		if (current->luf_no_shootdown) {
+			read_unlock_irqrestore(&lb->lock, flags);
+			return false;
+		}
 
-	if (arch_tlbbatch_check_done(&lb->batch.arch, lb_ugen)) {
+		fold_batch(tlb_ubc_takeoff, &lb->batch, false);
 		read_unlock_irqrestore(&lb->lock, flags);
+
+		if (!current->luf_ugen || ugen_before(current->luf_ugen, lb_ugen))
+			current->luf_ugen = lb_ugen;
 		return true;
 	}
 
-	if (current->luf_no_shootdown) {
-		read_unlock_irqrestore(&lb->lock, flags);
+	zone_ugen = page_zone_ugen(zone, page);
+	if (!zone_ugen)
+		return true;
+
+	/*
+	 * Should not be zero since zone-zone_ugen has been updated in
+	 * __free_one_page() -> update_zone_batch().
+	 */
+	VM_WARN_ON(!zone->zone_ugen);
+
+	if (!ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen))
+		return true;
+
+	if (current->luf_no_shootdown)
 		return false;
-	}
 
-	fold_batch(tlb_ubc_takeoff, &lb->batch, false);
-	read_unlock_irqrestore(&lb->lock, flags);
+	/*
+	 * zone batched flush has been already set.
+	 */
+	if (current->zone_ugen)
+		return true;
+
+	/*
+	 * Others are already performing tlb shootdown for us.  All we
+	 * need is to wait for those to complete.
+	 */
+	if (zone_ugen != zone->zone_ugen) {
+		if (!current->wait_zone_ugen ||
+		    ugen_before(current->wait_zone_ugen, zone_ugen))
+			current->wait_zone_ugen = zone_ugen;
+	/*
+	 * It's the first time that zone->zone_ugen has been set to
+	 * current->zone_ugen.  current->luf_ugen also get set.
+	 */
+	} else {
+		current->wait_zone_ugen = prev_ugen(zone->zone_ugen);
+		current->zone_ugen = zone->zone_ugen;
+		current->luf_ugen = zone->luf_ugen;
+
+		/*
+		 * Now that tlb shootdown for the zone_ugen will be
+		 * performed at luf_takeoff_end(), advance it so that
+		 * the next zone->lock holder can efficiently avoid
+		 * unnecessary tlb shootdown.
+		 */
+		zone->zone_ugen = next_ugen(zone->zone_ugen);
 
-	if (!current->luf_ugen || ugen_before(current->luf_ugen, lb_ugen))
-		current->luf_ugen = lb_ugen;
+		/*
+		 * All the luf pages will eventually become non-luf
+		 * pages by tlb flushing at luf_takeoff_end() and,
+		 * flush_pend_list_if_done() will empty pend_list.
+		 */
+		atomic_long_set(&zone->nr_luf_pages, 0);
+		fold_batch(tlb_ubc_takeoff, &zone->zone_batch, true);
+	}
 	return true;
 }
 #endif
@@ -822,6 +951,42 @@ static inline void account_freepages(struct zone *zone, int nr_pages,
 			   zone->nr_free_highatomic + nr_pages);
 }
 
+static void flush_pend_list_if_done(struct zone *zone,
+		struct free_area *area, int migratetype)
+{
+	unsigned long zone_ugen_done = READ_ONCE(zone->zone_ugen_done);
+
+	/*
+	 * tlb shootdown required for the zone_ugen already has been
+	 * done.  Thus, let's move pages in pend_list to free_list to
+	 * secure more non-luf pages.
+	 */
+	if (!ugen_before(zone_ugen_done, area->pend_zone_ugen[migratetype]))
+		list_splice_init(&area->pend_list[migratetype],
+				 &area->free_list[migratetype]);
+}
+
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+/*
+ * Should be called with zone->lock held and irq disabled.
+ */
+static void update_zone_batch(struct zone *zone, unsigned short luf_key)
+{
+	unsigned long lb_ugen;
+	struct luf_batch *lb = &luf_batch[luf_key];
+
+	read_lock(&lb->lock);
+	fold_batch(&zone->zone_batch, &lb->batch, false);
+	lb_ugen = lb->ugen;
+	read_unlock(&lb->lock);
+
+	if (ugen_before(zone->luf_ugen, lb_ugen))
+		zone->luf_ugen = lb_ugen;
+}
+#else
+static void update_zone_batch(struct zone *zone, unsigned short luf_key) {}
+#endif
+
 /* Used for pages not on another list */
 static inline void __add_to_free_list(struct page *page, struct zone *zone,
 				      unsigned int order, int migratetype,
@@ -830,6 +995,12 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
 	struct free_area *area = &zone->free_area[order];
 	struct list_head *list;
 
+	/*
+	 * Good chance to flush pend_list just before updating the
+	 * {free,pend}_list.
+	 */
+	flush_pend_list_if_done(zone, area, migratetype);
+
 	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
 		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
 		     get_pageblock_migratetype(page), migratetype, 1 << order);
@@ -839,8 +1010,9 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
 	 * positive is okay because it will cause just additional tlb
 	 * shootdown.
 	 */
-	if (page_luf_key(page)) {
+	if (page_zone_ugen(zone, page)) {
 		list = &area->pend_list[migratetype];
+		area->pend_zone_ugen[migratetype] = zone->zone_ugen;
 		atomic_long_add(1 << order, &zone->nr_luf_pages);
 	} else
 		list = &area->free_list[migratetype];
@@ -862,6 +1034,7 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 				     unsigned int order, int old_mt, int new_mt)
 {
 	struct free_area *area = &zone->free_area[order];
+	unsigned long zone_ugen = page_zone_ugen(zone, page);
 
 	/* Free page moving can fail, so it happens before the type update */
 	VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt,
@@ -878,9 +1051,12 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 	 * positive is okay because it will cause just additional tlb
 	 * shootdown.
 	 */
-	if (page_luf_key(page))
+	if (zone_ugen) {
 		list_move_tail(&page->buddy_list, &area->pend_list[new_mt]);
-	else
+		if (!area->pend_zone_ugen[new_mt] ||
+		    ugen_before(area->pend_zone_ugen[new_mt], zone_ugen))
+			area->pend_zone_ugen[new_mt] = zone_ugen;
+	} else
 		list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
 
 	account_freepages(zone, -(1 << order), old_mt);
@@ -898,7 +1074,7 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
 	if (page_reported(page))
 		__ClearPageReported(page);
 
-	if (page_luf_key(page))
+	if (page_zone_ugen(zone, page))
 		atomic_long_sub(1 << order, &zone->nr_luf_pages);
 
 	list_del(&page->buddy_list);
@@ -936,29 +1112,39 @@ static inline struct page *get_page_from_free_area(struct zone *zone,
 	 */
 	pend_first = !non_luf_pages_ok(zone);
 
+	/*
+	 * Good chance to flush pend_list just before updating the
+	 * {free,pend}_list.
+	 */
+	flush_pend_list_if_done(zone, area, migratetype);
+
 	if (pend_first) {
 		page = list_first_entry_or_null(&area->pend_list[migratetype],
 				struct page, buddy_list);
 
-		if (page && luf_takeoff_check(page))
+		if (page && luf_takeoff_check(zone, page))
 			return page;
 
 		page = list_first_entry_or_null(&area->free_list[migratetype],
 				struct page, buddy_list);
 
-		if (page)
+		if (page) {
+			set_page_zone_ugen(page, 0);
 			return page;
+		}
 	} else {
 		page = list_first_entry_or_null(&area->free_list[migratetype],
 				struct page, buddy_list);
 
-		if (page)
+		if (page) {
+			set_page_zone_ugen(page, 0);
 			return page;
+		}
 
 		page = list_first_entry_or_null(&area->pend_list[migratetype],
 				struct page, buddy_list);
 
-		if (page && luf_takeoff_check(page))
+		if (page && luf_takeoff_check(zone, page))
 			return page;
 	}
 	return NULL;
@@ -1023,6 +1209,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned long combined_pfn;
 	struct page *buddy;
 	bool to_tail;
+	unsigned long zone_ugen;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
@@ -1034,20 +1221,25 @@ static inline void __free_one_page(struct page *page,
 	account_freepages(zone, 1 << order, migratetype);
 
 	/*
-	 * Use the page's luf_key unchanged if luf_key == 0.  Worth
-	 * noting that page_luf_key() will be 0 in most cases since it's
-	 * initialized at free_pages_prepare().
+	 * Use the page's zone_ugen unchanged if luf_key == 0.  Worth
+	 * noting that page_zone_ugen() will be 0 in most cases since
+	 * it's initialized at free_pages_prepare().
+	 *
+	 * Update page's zone_ugen and zone's batch only if a valid
+	 * luf_key was passed.
 	 */
-	if (luf_key)
-		set_page_luf_key(page, luf_key);
-	else
-		luf_key = page_luf_key(page);
+	if (luf_key) {
+		zone_ugen = zone->zone_ugen;
+		set_page_zone_ugen(page, (unsigned short)zone_ugen);
+		update_zone_batch(zone, luf_key);
+	} else
+		zone_ugen = page_zone_ugen(zone, page);
 
 	while (order < MAX_PAGE_ORDER) {
 		int buddy_mt = migratetype;
-		unsigned short buddy_luf_key;
+		unsigned long buddy_zone_ugen;
 
-		if (!luf_key && compaction_capture(capc, page, order, migratetype)) {
+		if (!zone_ugen && compaction_capture(capc, page, order, migratetype)) {
 			account_freepages(zone, -(1 << order), migratetype);
 			return;
 		}
@@ -1080,17 +1272,15 @@ static inline void __free_one_page(struct page *page,
 		else
 			__del_page_from_free_list(buddy, zone, order, buddy_mt);
 
+		buddy_zone_ugen = page_zone_ugen(zone, buddy);
+
 		/*
-		 * !buddy_luf_key && !luf_key : do nothing
-		 *  buddy_luf_key && !luf_key : luf_key = buddy_luf_key
-		 * !buddy_luf_key &&  luf_key : do nothing
-		 *  buddy_luf_key &&  luf_key : merge two into luf_key
+		 * if (!zone_ugen && !buddy_zone_ugen) : nothing to do
+		 * if ( zone_ugen && !buddy_zone_ugen) : nothing to do
 		 */
-		buddy_luf_key = page_luf_key(buddy);
-		if (buddy_luf_key && !luf_key)
-			luf_key = buddy_luf_key;
-		else if (buddy_luf_key && luf_key)
-			fold_luf_batch(&luf_batch[luf_key], &luf_batch[buddy_luf_key]);
+		if ((!zone_ugen && buddy_zone_ugen) ||
+		    ( zone_ugen && buddy_zone_ugen && ugen_before(zone_ugen, buddy_zone_ugen)))
+			zone_ugen = buddy_zone_ugen;
 
 		if (unlikely(buddy_mt != migratetype)) {
 			/*
@@ -1103,7 +1293,7 @@ static inline void __free_one_page(struct page *page,
 
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
-		set_page_luf_key(page, luf_key);
+		set_page_zone_ugen(page, zone_ugen);
 		pfn = combined_pfn;
 		order++;
 	}
@@ -1446,6 +1636,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		do {
 			unsigned long pfn;
 			int mt;
+			unsigned short luf_key;
 
 			page = list_last_entry(list, struct page, pcp_list);
 			pfn = page_to_pfn(page);
@@ -1456,7 +1647,16 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			count -= nr_pages;
 			pcp->count -= nr_pages;
 
-			__free_one_page(page, pfn, zone, order, mt, FPI_NONE, 0);
+			/*
+			 * page private in pcp stores luf_key while it
+			 * stores zone_ugen in buddy.  Thus, the private
+			 * needs to be cleared and the luf_key needs to
+			 * be passed to buddy.
+			 */
+			luf_key = page_luf_key(page);
+			set_page_private(page, 0);
+
+			__free_one_page(page, pfn, zone, order, mt, FPI_NONE, luf_key);
 
 			trace_mm_page_pcpu_drain(page, order, mt);
 		} while (count > 0 && !list_empty(list));
@@ -1499,7 +1699,15 @@ static void free_one_page(struct zone *zone, struct page *page,
 	 * valid luf_key can be passed only if order == 0.
 	 */
 	VM_WARN_ON(luf_key && order);
-	set_page_luf_key(page, luf_key);
+
+	/*
+	 * Update page's zone_ugen and zone's batch only if a valid
+	 * luf_key was passed.
+	 */
+	if (luf_key) {
+		set_page_zone_ugen(page, (unsigned short)zone->zone_ugen);
+		update_zone_batch(zone, luf_key);
+	}
 
 	split_large_buddy(zone, page, pfn, order, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -1659,7 +1867,7 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 		if (set_page_guard(zone, &page[size], high))
 			continue;
 
-		if (page_luf_key(&page[size]))
+		if (page_zone_ugen(zone, &page[size]))
 			tail = true;
 
 		__add_to_free_list(&page[size], zone, high, migratetype, tail);
@@ -1677,7 +1885,7 @@ static __always_inline void page_del_and_expand(struct zone *zone,
 	int nr_pages = 1 << high;
 
 	__del_page_from_free_list(page, zone, high, migratetype);
-	if (unlikely(!luf_takeoff_check_and_fold(page)))
+	if (unlikely(!luf_takeoff_check_and_fold(zone, page)))
 		VM_WARN_ON(1);
 	nr_pages -= expand(zone, page, low, high, migratetype);
 	account_freepages(zone, -nr_pages, migratetype);
@@ -2199,7 +2407,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
 		unsigned int nr_added;
 
 		del_page_from_free_list(page, zone, current_order, block_type);
-		if (unlikely(!luf_takeoff_check_and_fold(page)))
+		if (unlikely(!luf_takeoff_check_and_fold(zone, page)))
 			VM_WARN_ON(1);
 		change_pageblock_range(page, current_order, start_type);
 		nr_added = expand(zone, page, order, current_order, start_type);
@@ -2438,12 +2646,12 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			WARN_ON_ONCE(ret == -1);
 			if (ret > 0) {
 				spin_unlock_irqrestore(&zone->lock, flags);
-				luf_takeoff_end();
+				luf_takeoff_end(zone);
 				return ret;
 			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
-		luf_takeoff_end();
+		luf_takeoff_end(zone);
 	}
 
 	return false;
@@ -2644,12 +2852,15 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * pages are ordered properly.
 		 */
 		list_add_tail(&page->pcp_list, list);
+
+		/*
+		 * Reset all the luf fields.  tlb shootdown will be
+		 * performed at luf_takeoff_end() below if needed.
+		 */
+		set_page_private(page, 0);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
-	/*
-	 * Check and flush before using the pages taken off.
-	 */
-	luf_takeoff_end();
+	luf_takeoff_end(zone);
 
 	return i;
 }
@@ -3163,7 +3374,7 @@ int __isolate_free_page(struct page *page, unsigned int order, bool willputback)
 	}
 
 	del_page_from_free_list(page, zone, order, mt);
-	if (unlikely(!willputback && !luf_takeoff_check_and_fold(page)))
+	if (unlikely(!willputback && !luf_takeoff_check_and_fold(zone, page)))
 		VM_WARN_ON(1);
 
 	/*
@@ -3262,7 +3473,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 			if (!page) {
 				spin_unlock_irqrestore(&zone->lock, flags);
-				luf_takeoff_end();
+				luf_takeoff_end(zone);
 				return NULL;
 			}
 		}
@@ -3270,7 +3481,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 		/*
 		 * Check and flush before using the pages taken off.
 		 */
-		luf_takeoff_end();
+		luf_takeoff_end(zone);
 	} while (check_new_pages(page, order));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3360,7 +3571,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		}
 
 		list_for_each_entry(page, list, pcp_list) {
-			if (luf_takeoff_check_and_fold(page)) {
+			if (luf_takeoff_check_and_fold(NULL, page)) {
 				list_del(&page->pcp_list);
 				pcp->count -= 1 << order;
 				break;
@@ -3395,7 +3606,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (!pcp) {
 		pcp_trylock_finish(UP_flags);
-		luf_takeoff_end();
+		luf_takeoff_end(NULL);
 		return NULL;
 	}
 
@@ -3412,7 +3623,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(NULL);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone, 1);
@@ -3451,6 +3662,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 							migratetype);
 
 out:
+
 	/* Separate test+clear to avoid unnecessary atomics */
 	if ((alloc_flags & ALLOC_KSWAPD) &&
 	    unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
@@ -5059,7 +5271,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(NULL);
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
 	zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_account);
@@ -5069,7 +5281,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 
 failed_irq:
 	pcp_trylock_finish(UP_flags);
-	luf_takeoff_end();
+	luf_takeoff_end(NULL);
 
 failed:
 	page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
@@ -7235,7 +7447,7 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 		VM_WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE);
 		order = buddy_order(page);
 		del_page_from_free_list(page, zone, order, MIGRATE_ISOLATE);
-		if (unlikely(!luf_takeoff_check_and_fold(page)))
+		if (unlikely(!luf_takeoff_check_and_fold(zone, page)))
 			VM_WARN_ON(1);
 		pfn += (1 << order);
 	}
@@ -7243,7 +7455,7 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(zone);
 
 	return end_pfn - start_pfn - already_offline;
 }
@@ -7305,7 +7517,7 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, current_buddy, high))
 			continue;
 
-		if (page_luf_key(current_buddy))
+		if (page_zone_ugen(zone, current_buddy))
 			tail = true;
 
 		add_to_free_list(current_buddy, zone, high, migratetype, tail);
@@ -7337,7 +7549,7 @@ bool take_page_off_buddy(struct page *page)
 
 			del_page_from_free_list(page_head, zone, page_order,
 						migratetype);
-			if (unlikely(!luf_takeoff_check_and_fold(page_head)))
+			if (unlikely(!luf_takeoff_check_and_fold(zone, page_head)))
 				VM_WARN_ON(1);
 			break_down_buddy_pages(zone, page_head, page, 0,
 						page_order, migratetype);
@@ -7353,7 +7565,7 @@ bool take_page_off_buddy(struct page *page)
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(zone);
 	return ret;
 }
 
@@ -7372,6 +7584,13 @@ bool put_page_back_buddy(struct page *page)
 		int migratetype = get_pfnblock_migratetype(page, pfn);
 
 		ClearPageHWPoisonTakenOff(page);
+
+		/*
+		 * Reset all the luf fields.  tlb shootdown has already
+		 * been performed by take_page_off_buddy().
+		 */
+		set_page_private(page, 0);
+
 		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE, 0);
 		if (TestClearPageHWPoison(page)) {
 			ret = true;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index e152b22fbba8a..b23d3ed34ec07 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -118,7 +118,8 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
 
 		/*
 		 * Ensure private is zero before putting into the
-		 * allocator.
+		 * allocator.  tlb shootdown has already been performed
+		 * at isolation.
 		 */
 		set_page_private(page, 0);
 
@@ -194,7 +195,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		if (PageReported(page))
 			continue;
 
-		if (unlikely(consider_pend && !luf_takeoff_check(page))) {
+		if (unlikely(consider_pend && !luf_takeoff_check(zone, page))) {
 			VM_WARN_ON(1);
 			continue;
 		}
@@ -238,7 +239,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		/*
 		 * Check and flush before using the pages taken off.
 		 */
-		luf_takeoff_end();
+		luf_takeoff_end(zone);
 
 		/* begin processing pages in local list */
 		err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
@@ -283,7 +284,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(zone);
 
 	return err;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 0cb13e8fcd739..ebe91ff1bcb16 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -650,7 +650,11 @@ static unsigned long new_luf_ugen(void)
 {
 	unsigned long ugen = atomic_long_inc_return(&luf_ugen);
 
-	if (!ugen)
+	/*
+	 * Avoid zero even in unsigned short range so as to treat
+	 * '(unsigned short)ugen == 0' as invalid.
+	 */
+	if (!(unsigned short)ugen)
 		ugen = atomic_long_inc_return(&luf_ugen);
 
 	return ugen;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 22/26] mm/page_alloc: not allow to tlb shootdown if !preemptable() && non_luf_pages_ok()
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (20 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 21/26] mm: perform luf tlb shootdown per zone in batched manner Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 23/26] mm: separate move/undo parts from migrate_pages_batch() Byungchul Park
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Do not perform tlb shootdown if the context is in preempt disable and
there are already enough non luf pages, not to hurt preemptibility.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/compaction.c     |  6 +++---
 mm/internal.h       |  5 +++--
 mm/page_alloc.c     | 27 +++++++++++++++------------
 mm/page_isolation.c |  2 +-
 mm/page_reporting.c |  4 ++--
 5 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index a7f17867decae..8fa9de6db2441 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -605,7 +605,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 
 	page = pfn_to_page(blockpfn);
 
-	luf_takeoff_start();
+	luf_takeoff_start(cc->zone);
 	/* Isolate free pages. */
 	for (; blockpfn < end_pfn; blockpfn += stride, page += stride) {
 		int isolated;
@@ -1601,7 +1601,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 		if (!area->nr_free)
 			continue;
 
-		can_shootdown = luf_takeoff_start();
+		can_shootdown = luf_takeoff_start(cc->zone);
 		spin_lock_irqsave(&cc->zone->lock, flags);
 		freelist = &area->free_list[MIGRATE_MOVABLE];
 retry:
@@ -2413,7 +2413,7 @@ static enum compact_result compact_finished(struct compact_control *cc)
 	 * luf_takeoff_{start,end}() is required to identify whether
 	 * this compaction context is tlb shootdownable for luf'd pages.
 	 */
-	luf_takeoff_start();
+	luf_takeoff_start(cc->zone);
 	ret = __compact_finished(cc);
 	luf_takeoff_end(cc->zone);
 
diff --git a/mm/internal.h b/mm/internal.h
index e634eaf220f00..fba19c283ac48 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1594,7 +1594,7 @@ static inline void accept_page(struct page *page)
 #endif /* CONFIG_UNACCEPTED_MEMORY */
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 extern struct luf_batch luf_batch[];
-bool luf_takeoff_start(void);
+bool luf_takeoff_start(struct zone *zone);
 void luf_takeoff_end(struct zone *zone);
 bool luf_takeoff_no_shootdown(void);
 bool luf_takeoff_check(struct zone *zone, struct page *page);
@@ -1608,6 +1608,7 @@ static inline bool non_luf_pages_ok(struct zone *zone)
 
 	return nr_free - nr_luf_pages > min_wm;
 }
+
 unsigned short fold_unmap_luf(void);
 
 /*
@@ -1694,7 +1695,7 @@ static inline bool can_luf_vma(struct vm_area_struct *vma)
 	return true;
 }
 #else /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
-static inline bool luf_takeoff_start(void) { return false; }
+static inline bool luf_takeoff_start(struct zone *zone) { return false; }
 static inline void luf_takeoff_end(struct zone *zone) {}
 static inline bool luf_takeoff_no_shootdown(void) { return true; }
 static inline bool luf_takeoff_check(struct zone *zone, struct page *page) { return true; }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b81931c6f2cfd..ccbe49b78190a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -623,22 +623,25 @@ compaction_capture(struct capture_control *capc, struct page *page,
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
-static bool no_shootdown_context(void)
+static bool no_shootdown_context(struct zone *zone)
 {
 	/*
-	 * If it performs with irq disabled, that might cause a deadlock.
-	 * Avoid tlb shootdown in this case.
+	 * Tries to avoid tlb shootdown if !preemptible().  However, it
+	 * should be allowed under heavy memory pressure.
 	 */
+	if (zone && non_luf_pages_ok(zone))
+		return !(preemptible() && in_task());
+
 	return !(!irqs_disabled() && in_task());
 }
 
 /*
  * Can be called with zone lock released and irq enabled.
  */
-bool luf_takeoff_start(void)
+bool luf_takeoff_start(struct zone *zone)
 {
 	unsigned long flags;
-	bool no_shootdown = no_shootdown_context();
+	bool no_shootdown = no_shootdown_context(zone);
 
 	local_irq_save(flags);
 
@@ -2588,7 +2591,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 		 * luf_takeoff_{start,end}() is required for
 		 * get_page_from_free_area() to use luf_takeoff_check().
 		 */
-		luf_takeoff_start();
+		luf_takeoff_start(zone);
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < NR_PAGE_ORDERS; order++) {
 			struct free_area *area = &(zone->free_area[order]);
@@ -2829,7 +2832,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	unsigned long flags;
 	int i;
 
-	luf_takeoff_start();
+	luf_takeoff_start(zone);
 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
@@ -3455,7 +3458,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 	do {
 		page = NULL;
-		luf_takeoff_start();
+		luf_takeoff_start(zone);
 		spin_lock_irqsave(&zone->lock, flags);
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
@@ -3600,7 +3603,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct page *page;
 	unsigned long __maybe_unused UP_flags;
 
-	luf_takeoff_start();
+	luf_takeoff_start(NULL);
 	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
@@ -5229,7 +5232,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	if (unlikely(!zone))
 		goto failed;
 
-	luf_takeoff_start();
+	luf_takeoff_start(NULL);
 	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
@@ -7418,7 +7421,7 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 
 	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
-	luf_takeoff_start();
+	luf_takeoff_start(zone);
 	spin_lock_irqsave(&zone->lock, flags);
 	while (pfn < end_pfn) {
 		page = pfn_to_page(pfn);
@@ -7536,7 +7539,7 @@ bool take_page_off_buddy(struct page *page)
 	unsigned int order;
 	bool ret = false;
 
-	luf_takeoff_start();
+	luf_takeoff_start(zone);
 	spin_lock_irqsave(&zone->lock, flags);
 	for (order = 0; order < NR_PAGE_ORDERS; order++) {
 		struct page *page_head = page - (pfn & ((1 << order) - 1));
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index eae33d188762b..ccd36838f9cff 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -211,7 +211,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	struct page *buddy;
 
 	zone = page_zone(page);
-	luf_takeoff_start();
+	luf_takeoff_start(zone);
 	spin_lock_irqsave(&zone->lock, flags);
 	if (!is_migrate_isolate_page(page))
 		goto out;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index b23d3ed34ec07..83b66e7f0d257 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -170,7 +170,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	if (free_area_empty(area, mt))
 		return err;
 
-	can_shootdown = luf_takeoff_start();
+	can_shootdown = luf_takeoff_start(zone);
 	spin_lock_irq(&zone->lock);
 
 	/*
@@ -250,7 +250,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		/* update budget to reflect call to report function */
 		budget--;
 
-		luf_takeoff_start();
+		luf_takeoff_start(zone);
 
 		/* reacquire zone lock and resume processing */
 		spin_lock_irq(&zone->lock);
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 23/26] mm: separate move/undo parts from migrate_pages_batch()
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (21 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 22/26] mm/page_alloc: not allow to tlb shootdown if !preemptable() && non_luf_pages_ok() Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 24/26] mm/migrate: apply luf mechanism to unmapping during migration Byungchul Park
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
requires to use separated folio lists for its own handling during
migration.  Refactored migrate_pages_batch() so as to separate move/undo
parts from migrate_pages_batch().

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/migrate.c | 134 +++++++++++++++++++++++++++++++--------------------
 1 file changed, 83 insertions(+), 51 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index dfb5eba3c5223..5e12023dbc75a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1695,6 +1695,81 @@ static int migrate_hugetlbs(struct list_head *from, new_folio_t get_new_folio,
 	return nr_failed;
 }
 
+static void migrate_folios_move(struct list_head *src_folios,
+		struct list_head *dst_folios,
+		free_folio_t put_new_folio, unsigned long private,
+		enum migrate_mode mode, int reason,
+		struct list_head *ret_folios,
+		struct migrate_pages_stats *stats,
+		int *retry, int *thp_retry, int *nr_failed,
+		int *nr_retry_pages)
+{
+	struct folio *folio, *folio2, *dst, *dst2;
+	bool is_thp;
+	int nr_pages;
+	int rc;
+
+	dst = list_first_entry(dst_folios, struct folio, lru);
+	dst2 = list_next_entry(dst, lru);
+	list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+		is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
+		nr_pages = folio_nr_pages(folio);
+
+		cond_resched();
+
+		rc = migrate_folio_move(put_new_folio, private,
+				folio, dst, mode,
+				reason, ret_folios);
+		/*
+		 * The rules are:
+		 *	Success: folio will be freed
+		 *	-EAGAIN: stay on the unmap_folios list
+		 *	Other errno: put on ret_folios list
+		 */
+		switch (rc) {
+		case -EAGAIN:
+			*retry += 1;
+			*thp_retry += is_thp;
+			*nr_retry_pages += nr_pages;
+			break;
+		case MIGRATEPAGE_SUCCESS:
+			stats->nr_succeeded += nr_pages;
+			stats->nr_thp_succeeded += is_thp;
+			break;
+		default:
+			*nr_failed += 1;
+			stats->nr_thp_failed += is_thp;
+			stats->nr_failed_pages += nr_pages;
+			break;
+		}
+		dst = dst2;
+		dst2 = list_next_entry(dst, lru);
+	}
+}
+
+static void migrate_folios_undo(struct list_head *src_folios,
+		struct list_head *dst_folios,
+		free_folio_t put_new_folio, unsigned long private,
+		struct list_head *ret_folios)
+{
+	struct folio *folio, *folio2, *dst, *dst2;
+
+	dst = list_first_entry(dst_folios, struct folio, lru);
+	dst2 = list_next_entry(dst, lru);
+	list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+		int old_page_state = 0;
+		struct anon_vma *anon_vma = NULL;
+
+		__migrate_folio_extract(dst, &old_page_state, &anon_vma);
+		migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
+				anon_vma, true, ret_folios);
+		list_del(&dst->lru);
+		migrate_folio_undo_dst(dst, true, put_new_folio, private);
+		dst = dst2;
+		dst2 = list_next_entry(dst, lru);
+	}
+}
+
 /*
  * migrate_pages_batch() first unmaps folios in the from list as many as
  * possible, then move the unmapped folios.
@@ -1717,7 +1792,7 @@ static int migrate_pages_batch(struct list_head *from,
 	int pass = 0;
 	bool is_thp = false;
 	bool is_large = false;
-	struct folio *folio, *folio2, *dst = NULL, *dst2;
+	struct folio *folio, *folio2, *dst = NULL;
 	int rc, rc_saved = 0, nr_pages;
 	LIST_HEAD(unmap_folios);
 	LIST_HEAD(dst_folios);
@@ -1888,42 +1963,11 @@ static int migrate_pages_batch(struct list_head *from,
 		thp_retry = 0;
 		nr_retry_pages = 0;
 
-		dst = list_first_entry(&dst_folios, struct folio, lru);
-		dst2 = list_next_entry(dst, lru);
-		list_for_each_entry_safe(folio, folio2, &unmap_folios, lru) {
-			is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
-			nr_pages = folio_nr_pages(folio);
-
-			cond_resched();
-
-			rc = migrate_folio_move(put_new_folio, private,
-						folio, dst, mode,
-						reason, ret_folios);
-			/*
-			 * The rules are:
-			 *	Success: folio will be freed
-			 *	-EAGAIN: stay on the unmap_folios list
-			 *	Other errno: put on ret_folios list
-			 */
-			switch(rc) {
-			case -EAGAIN:
-				retry++;
-				thp_retry += is_thp;
-				nr_retry_pages += nr_pages;
-				break;
-			case MIGRATEPAGE_SUCCESS:
-				stats->nr_succeeded += nr_pages;
-				stats->nr_thp_succeeded += is_thp;
-				break;
-			default:
-				nr_failed++;
-				stats->nr_thp_failed += is_thp;
-				stats->nr_failed_pages += nr_pages;
-				break;
-			}
-			dst = dst2;
-			dst2 = list_next_entry(dst, lru);
-		}
+		/* Move the unmapped folios */
+		migrate_folios_move(&unmap_folios, &dst_folios,
+				put_new_folio, private, mode, reason,
+				ret_folios, stats, &retry, &thp_retry,
+				&nr_failed, &nr_retry_pages);
 	}
 	nr_failed += retry;
 	stats->nr_thp_failed += thp_retry;
@@ -1932,20 +1976,8 @@ static int migrate_pages_batch(struct list_head *from,
 	rc = rc_saved ? : nr_failed;
 out:
 	/* Cleanup remaining folios */
-	dst = list_first_entry(&dst_folios, struct folio, lru);
-	dst2 = list_next_entry(dst, lru);
-	list_for_each_entry_safe(folio, folio2, &unmap_folios, lru) {
-		int old_page_state = 0;
-		struct anon_vma *anon_vma = NULL;
-
-		__migrate_folio_extract(dst, &old_page_state, &anon_vma);
-		migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
-				       anon_vma, true, ret_folios);
-		list_del(&dst->lru);
-		migrate_folio_undo_dst(dst, true, put_new_folio, private);
-		dst = dst2;
-		dst2 = list_next_entry(dst, lru);
-	}
+	migrate_folios_undo(&unmap_folios, &dst_folios,
+			put_new_folio, private, ret_folios);
 
 	return rc;
 }
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 24/26] mm/migrate: apply luf mechanism to unmapping during migration
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (22 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 23/26] mm: separate move/undo parts from migrate_pages_batch() Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 25/26] mm/vmscan: apply luf mechanism to unmapping during folio reclaim Byungchul Park
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

Applied the mechanism to unmapping during migration.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm.h   |  2 ++
 include/linux/rmap.h |  2 +-
 mm/migrate.c         | 65 ++++++++++++++++++++++++++++++++++----------
 mm/rmap.c            | 15 ++++++----
 mm/swap.c            |  2 +-
 5 files changed, 63 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46638e86e8073..5c81c9831bc5d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1476,6 +1476,8 @@ static inline void folio_put(struct folio *folio)
 		__folio_put(folio);
 }
 
+void page_cache_release(struct folio *folio);
+
 /**
  * folio_put_refs - Reduce the reference count on a folio.
  * @folio: The folio.
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 683a04088f3f2..cedba4812ccc7 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -660,7 +660,7 @@ static inline int folio_try_share_anon_rmap_pmd(struct folio *folio,
 int folio_referenced(struct folio *, int is_locked,
 			struct mem_cgroup *memcg, unsigned long *vm_flags);
 
-void try_to_migrate(struct folio *folio, enum ttu_flags flags);
+bool try_to_migrate(struct folio *folio, enum ttu_flags flags);
 void try_to_unmap(struct folio *, enum ttu_flags flags);
 
 int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
diff --git a/mm/migrate.c b/mm/migrate.c
index 5e12023dbc75a..6b77efee4ebd7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1172,7 +1172,8 @@ static void migrate_folio_undo_dst(struct folio *dst, bool locked,
 
 /* Cleanup src folio upon migration success */
 static void migrate_folio_done(struct folio *src,
-			       enum migrate_reason reason)
+			       enum migrate_reason reason,
+			       unsigned short luf_key)
 {
 	/*
 	 * Compaction can migrate also non-LRU pages which are
@@ -1183,16 +1184,30 @@ static void migrate_folio_done(struct folio *src,
 		mod_node_page_state(folio_pgdat(src), NR_ISOLATED_ANON +
 				    folio_is_file_lru(src), -folio_nr_pages(src));
 
-	if (reason != MR_MEMORY_FAILURE)
-		/* We release the page in page_handle_poison. */
+	/* We release the page in page_handle_poison. */
+	if (reason == MR_MEMORY_FAILURE)
+		luf_flush(luf_key);
+	else if (!luf_key)
 		folio_put(src);
+	else {
+		/*
+		 * Should be the last reference.
+		 */
+		if (unlikely(!folio_put_testzero(src)))
+			VM_WARN_ON(1);
+
+		page_cache_release(src);
+		mem_cgroup_uncharge(src);
+		free_unref_page(&src->page, folio_order(src), luf_key);
+	}
 }
 
 /* Obtain the lock on page, remove all ptes. */
 static int migrate_folio_unmap(new_folio_t get_new_folio,
 		free_folio_t put_new_folio, unsigned long private,
 		struct folio *src, struct folio **dstp, enum migrate_mode mode,
-		enum migrate_reason reason, struct list_head *ret)
+		enum migrate_reason reason, struct list_head *ret,
+		bool *can_luf)
 {
 	struct folio *dst;
 	int rc = -EAGAIN;
@@ -1208,7 +1223,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
 		folio_clear_unevictable(src);
 		/* free_pages_prepare() will clear PG_isolated. */
 		list_del(&src->lru);
-		migrate_folio_done(src, reason);
+		migrate_folio_done(src, reason, 0);
 		return MIGRATEPAGE_SUCCESS;
 	}
 
@@ -1325,7 +1340,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
 		/* Establish migration ptes */
 		VM_BUG_ON_FOLIO(folio_test_anon(src) &&
 			       !folio_test_ksm(src) && !anon_vma, src);
-		try_to_migrate(src, mode == MIGRATE_ASYNC ? TTU_BATCH_FLUSH : 0);
+		*can_luf = try_to_migrate(src, mode == MIGRATE_ASYNC ? TTU_BATCH_FLUSH : 0);
 		old_page_state |= PAGE_WAS_MAPPED;
 	}
 
@@ -1353,7 +1368,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
 static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 			      struct folio *src, struct folio *dst,
 			      enum migrate_mode mode, enum migrate_reason reason,
-			      struct list_head *ret)
+			      struct list_head *ret, unsigned short luf_key)
 {
 	int rc;
 	int old_page_state = 0;
@@ -1407,7 +1422,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 	folio_unlock(src);
-	migrate_folio_done(src, reason);
+	migrate_folio_done(src, reason, luf_key);
 
 	return rc;
 out:
@@ -1702,7 +1717,7 @@ static void migrate_folios_move(struct list_head *src_folios,
 		struct list_head *ret_folios,
 		struct migrate_pages_stats *stats,
 		int *retry, int *thp_retry, int *nr_failed,
-		int *nr_retry_pages)
+		int *nr_retry_pages, unsigned short luf_key)
 {
 	struct folio *folio, *folio2, *dst, *dst2;
 	bool is_thp;
@@ -1719,7 +1734,7 @@ static void migrate_folios_move(struct list_head *src_folios,
 
 		rc = migrate_folio_move(put_new_folio, private,
 				folio, dst, mode,
-				reason, ret_folios);
+				reason, ret_folios, luf_key);
 		/*
 		 * The rules are:
 		 *	Success: folio will be freed
@@ -1796,7 +1811,11 @@ static int migrate_pages_batch(struct list_head *from,
 	int rc, rc_saved = 0, nr_pages;
 	LIST_HEAD(unmap_folios);
 	LIST_HEAD(dst_folios);
+	LIST_HEAD(unmap_folios_luf);
+	LIST_HEAD(dst_folios_luf);
 	bool nosplit = (reason == MR_NUMA_MISPLACED);
+	unsigned short luf_key;
+	bool can_luf;
 
 	VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC &&
 			!list_empty(from) && !list_is_singular(from));
@@ -1871,9 +1890,11 @@ static int migrate_pages_batch(struct list_head *from,
 				continue;
 			}
 
+			can_luf = false;
 			rc = migrate_folio_unmap(get_new_folio, put_new_folio,
 					private, folio, &dst, mode, reason,
-					ret_folios);
+					ret_folios, &can_luf);
+
 			/*
 			 * The rules are:
 			 *	Success: folio will be freed
@@ -1919,7 +1940,8 @@ static int migrate_pages_batch(struct list_head *from,
 				/* nr_failed isn't updated for not used */
 				stats->nr_thp_failed += thp_retry;
 				rc_saved = rc;
-				if (list_empty(&unmap_folios))
+				if (list_empty(&unmap_folios) &&
+				    list_empty(&unmap_folios_luf))
 					goto out;
 				else
 					goto move;
@@ -1933,8 +1955,13 @@ static int migrate_pages_batch(struct list_head *from,
 				stats->nr_thp_succeeded += is_thp;
 				break;
 			case MIGRATEPAGE_UNMAP:
-				list_move_tail(&folio->lru, &unmap_folios);
-				list_add_tail(&dst->lru, &dst_folios);
+				if (can_luf) {
+					list_move_tail(&folio->lru, &unmap_folios_luf);
+					list_add_tail(&dst->lru, &dst_folios_luf);
+				} else {
+					list_move_tail(&folio->lru, &unmap_folios);
+					list_add_tail(&dst->lru, &dst_folios);
+				}
 				break;
 			default:
 				/*
@@ -1954,6 +1981,8 @@ static int migrate_pages_batch(struct list_head *from,
 	stats->nr_thp_failed += thp_retry;
 	stats->nr_failed_pages += nr_retry_pages;
 move:
+	/* Should be before try_to_unmap_flush() */
+	luf_key = fold_unmap_luf();
 	/* Flush TLBs for all unmapped folios */
 	try_to_unmap_flush();
 
@@ -1967,7 +1996,11 @@ static int migrate_pages_batch(struct list_head *from,
 		migrate_folios_move(&unmap_folios, &dst_folios,
 				put_new_folio, private, mode, reason,
 				ret_folios, stats, &retry, &thp_retry,
-				&nr_failed, &nr_retry_pages);
+				&nr_failed, &nr_retry_pages, 0);
+		migrate_folios_move(&unmap_folios_luf, &dst_folios_luf,
+				put_new_folio, private, mode, reason,
+				ret_folios, stats, &retry, &thp_retry,
+				&nr_failed, &nr_retry_pages, luf_key);
 	}
 	nr_failed += retry;
 	stats->nr_thp_failed += thp_retry;
@@ -1978,6 +2011,8 @@ static int migrate_pages_batch(struct list_head *from,
 	/* Cleanup remaining folios */
 	migrate_folios_undo(&unmap_folios, &dst_folios,
 			put_new_folio, private, ret_folios);
+	migrate_folios_undo(&unmap_folios_luf, &dst_folios_luf,
+			put_new_folio, private, ret_folios);
 
 	return rc;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index ebe91ff1bcb16..b6b61b8103655 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2750,8 +2750,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
  *
  * Tries to remove all the page table entries which are mapping this folio and
  * replace them with special swap entries. Caller must hold the folio lock.
+ * Return true if all the mappings are read-only, otherwise false.
  */
-void try_to_migrate(struct folio *folio, enum ttu_flags flags)
+bool try_to_migrate(struct folio *folio, enum ttu_flags flags)
 {
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_migrate_one,
@@ -2769,11 +2770,11 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 	 */
 	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
 					TTU_SYNC | TTU_BATCH_FLUSH)))
-		return;
+		return false;
 
 	if (folio_is_zone_device(folio) &&
 	    (!folio_is_device_private(folio) && !folio_is_device_coherent(folio)))
-		return;
+		return false;
 
 	/*
 	 * During exec, a temporary VMA is setup and later moved.
@@ -2793,10 +2794,12 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 	else
 		rmap_walk(folio, &rwc);
 
-	if (can_luf_test())
+	if (can_luf_test()) {
 		fold_batch(tlb_ubc_luf, tlb_ubc_ro, true);
-	else
-		fold_batch(tlb_ubc, tlb_ubc_ro, true);
+		return true;
+	}
+	fold_batch(tlb_ubc, tlb_ubc_ro, true);
+	return false;
 }
 
 #ifdef CONFIG_DEVICE_PRIVATE
diff --git a/mm/swap.c b/mm/swap.c
index 54b0ba10dbb86..d6c29fdc67ca5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -84,7 +84,7 @@ static void __page_cache_release(struct folio *folio, struct lruvec **lruvecp,
  * This path almost never happens for VM activity - pages are normally freed
  * in batches.  But it gets used by networking - and for compound pages.
  */
-static void page_cache_release(struct folio *folio)
+void page_cache_release(struct folio *folio)
 {
 	struct lruvec *lruvec = NULL;
 	unsigned long flags;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 25/26] mm/vmscan: apply luf mechanism to unmapping during folio reclaim
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (23 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 24/26] mm/migrate: apply luf mechanism to unmapping during migration Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20  5:20 ` [RFC PATCH v12 26/26] mm/luf: implement luf debug feature Byungchul Park
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

Applied the mechanism to unmapping during folio reclaim.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/rmap.h |  5 +++--
 mm/rmap.c            | 11 +++++++----
 mm/vmscan.c          | 37 ++++++++++++++++++++++++++++++++-----
 3 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index cedba4812ccc7..854b41441d466 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -661,7 +661,7 @@ int folio_referenced(struct folio *, int is_locked,
 			struct mem_cgroup *memcg, unsigned long *vm_flags);
 
 bool try_to_migrate(struct folio *folio, enum ttu_flags flags);
-void try_to_unmap(struct folio *, enum ttu_flags flags);
+bool try_to_unmap(struct folio *, enum ttu_flags flags);
 
 int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, struct page **pages,
@@ -794,8 +794,9 @@ static inline int folio_referenced(struct folio *folio, int is_locked,
 	return 0;
 }
 
-static inline void try_to_unmap(struct folio *folio, enum ttu_flags flags)
+static inline bool try_to_unmap(struct folio *folio, enum ttu_flags flags)
 {
+	return false;
 }
 
 static inline int folio_mkclean(struct folio *folio)
diff --git a/mm/rmap.c b/mm/rmap.c
index b6b61b8103655..55003eb0b4936 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2386,10 +2386,11 @@ static int folio_not_mapped(struct folio *folio)
  * Tries to remove all the page table entries which are mapping this
  * folio.  It is the caller's responsibility to check if the folio is
  * still mapped if needed (use TTU_SYNC to prevent accounting races).
+ * Return true if all the mappings are read-only, otherwise false.
  *
  * Context: Caller must hold the folio lock.
  */
-void try_to_unmap(struct folio *folio, enum ttu_flags flags)
+bool try_to_unmap(struct folio *folio, enum ttu_flags flags)
 {
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_unmap_one,
@@ -2408,10 +2409,12 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags)
 	else
 		rmap_walk(folio, &rwc);
 
-	if (can_luf_test())
+	if (can_luf_test()) {
 		fold_batch(tlb_ubc_luf, tlb_ubc_ro, true);
-	else
-		fold_batch(tlb_ubc, tlb_ubc_ro, true);
+		return true;
+	}
+	fold_batch(tlb_ubc, tlb_ubc_ro, true);
+	return false;
 }
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cbca027d2a10e..1ece0ccfccefb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1052,14 +1052,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		struct reclaim_stat *stat, bool ignore_references)
 {
 	struct folio_batch free_folios;
+	struct folio_batch free_folios_luf;
 	LIST_HEAD(ret_folios);
 	LIST_HEAD(demote_folios);
 	unsigned int nr_reclaimed = 0;
 	unsigned int pgactivate = 0;
 	bool do_demote_pass;
 	struct swap_iocb *plug = NULL;
+	unsigned short luf_key;
 
 	folio_batch_init(&free_folios);
+	folio_batch_init(&free_folios_luf);
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
 	do_demote_pass = can_demote(pgdat->node_id, sc);
@@ -1071,6 +1074,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		enum folio_references references = FOLIOREF_RECLAIM;
 		bool dirty, writeback;
 		unsigned int nr_pages;
+		bool can_luf = false;
 
 		cond_resched();
 
@@ -1309,7 +1313,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			if (folio_test_large(folio))
 				flags |= TTU_SYNC;
 
-			try_to_unmap(folio, flags);
+			can_luf = try_to_unmap(folio, flags);
 			if (folio_mapped(folio)) {
 				stat->nr_unmap_fail += nr_pages;
 				if (!was_swapbacked &&
@@ -1453,6 +1457,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					 * leave it off the LRU).
 					 */
 					nr_reclaimed += nr_pages;
+					if (can_luf)
+						luf_flush(fold_unmap_luf());
 					continue;
 				}
 			}
@@ -1485,6 +1491,19 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		nr_reclaimed += nr_pages;
 
 		folio_unqueue_deferred_split(folio);
+
+		if (can_luf) {
+			if (folio_batch_add(&free_folios_luf, folio) == 0) {
+				mem_cgroup_uncharge_folios(&free_folios);
+				mem_cgroup_uncharge_folios(&free_folios_luf);
+				luf_key = fold_unmap_luf();
+				try_to_unmap_flush();
+				free_unref_folios(&free_folios, 0);
+				free_unref_folios(&free_folios_luf, luf_key);
+			}
+			continue;
+		}
+
 		if (folio_batch_add(&free_folios, folio) == 0) {
 			mem_cgroup_uncharge_folios(&free_folios);
 			try_to_unmap_flush();
@@ -1519,9 +1538,21 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		list_add(&folio->lru, &ret_folios);
 		VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
 				folio_test_unevictable(folio), folio);
+		if (can_luf)
+			luf_flush(fold_unmap_luf());
 	}
 	/* 'folio_list' is always empty here */
 
+	/*
+	 * Finalize this turn before demote_folio_list().
+	 */
+	mem_cgroup_uncharge_folios(&free_folios);
+	mem_cgroup_uncharge_folios(&free_folios_luf);
+	luf_key = fold_unmap_luf();
+	try_to_unmap_flush();
+	free_unref_folios(&free_folios, 0);
+	free_unref_folios(&free_folios_luf, luf_key);
+
 	/* Migrate folios selected for demotion */
 	stat->nr_demoted = demote_folio_list(&demote_folios, pgdat);
 	nr_reclaimed += stat->nr_demoted;
@@ -1554,10 +1585,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 	pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
 
-	mem_cgroup_uncharge_folios(&free_folios);
-	try_to_unmap_flush();
-	free_unref_folios(&free_folios, 0);
-
 	list_splice(&ret_folios, folio_list);
 	count_vm_events(PGACTIVATE, pgactivate);
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 26/26] mm/luf: implement luf debug feature
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (24 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 25/26] mm/vmscan: apply luf mechanism to unmapping during folio reclaim Byungchul Park
@ 2025-02-20  5:20 ` Byungchul Park
  2025-02-20 10:32 ` [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Hillf Danton
  2025-02-20 15:15 ` Dave Hansen
  27 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20  5:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

We need luf debug feature to detect when luf goes wrong by any chance.
As a RFC, suggest a simple implementation to report problematic
situations by luf.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/riscv/include/asm/tlbflush.h |   3 +
 arch/riscv/mm/tlbflush.c          |  35 ++++-
 arch/x86/include/asm/pgtable.h    |  10 ++
 arch/x86/include/asm/tlbflush.h   |   3 +
 arch/x86/mm/pgtable.c             |  10 ++
 arch/x86/mm/tlb.c                 |  35 ++++-
 include/linux/highmem-internal.h  |   5 +
 include/linux/mm.h                |  20 ++-
 include/linux/mm_types.h          |  16 +--
 include/linux/mm_types_task.h     |  16 +++
 include/linux/sched.h             |   5 +
 mm/highmem.c                      |   1 +
 mm/memory.c                       |  12 ++
 mm/page_alloc.c                   |  34 ++++-
 mm/page_ext.c                     |   3 +
 mm/rmap.c                         | 229 ++++++++++++++++++++++++++++++
 16 files changed, 418 insertions(+), 19 deletions(-)

diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index ec5caeb3cf8ef..9451f3d22f229 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -69,6 +69,9 @@ bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned
 bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
 void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
 void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen);
+#ifdef CONFIG_LUF_DEBUG
+extern void print_lufd_arch(void);
+#endif
 
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 93afb7a299003..de91bfe0426c2 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -216,6 +216,25 @@ static int __init luf_init_arch(void)
 }
 early_initcall(luf_init_arch);
 
+#ifdef CONFIG_LUF_DEBUG
+static DEFINE_SPINLOCK(luf_debug_lock);
+#define lufd_lock(f) spin_lock_irqsave(&luf_debug_lock, (f))
+#define lufd_unlock(f) spin_unlock_irqrestore(&luf_debug_lock, (f))
+
+void print_lufd_arch(void)
+{
+	int cpu;
+
+	pr_cont("LUFD ARCH:");
+	for_each_cpu(cpu, cpu_possible_mask)
+		pr_cont(" %lu", atomic_long_read(per_cpu_ptr(&ugen_done, cpu)));
+	pr_cont("\n");
+}
+#else
+#define lufd_lock(f) do { (void)(f); } while(0)
+#define lufd_unlock(f) do { (void)(f); } while(0)
+#endif
+
 /*
  * batch will not be updated.
  */
@@ -223,17 +242,22 @@ bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch,
 			unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		goto out;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		unsigned long done;
 
 		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
-		if (ugen_before(done, ugen))
+		if (ugen_before(done, ugen)) {
+			lufd_unlock(flags);
 			return false;
+		}
 	}
+	lufd_unlock(flags);
 	return true;
 out:
 	return cpumask_empty(&batch->cpumask);
@@ -243,10 +267,12 @@ bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
 			unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		goto out;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		unsigned long done;
 
@@ -254,6 +280,7 @@ bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
 		if (!ugen_before(done, ugen))
 			cpumask_clear_cpu(cpu, &batch->cpumask);
 	}
+	lufd_unlock(flags);
 out:
 	return cpumask_empty(&batch->cpumask);
 }
@@ -262,10 +289,12 @@ void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
 			     unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		return;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
 		unsigned long old = atomic_long_read(done);
@@ -283,15 +312,18 @@ void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
 		 */
 		atomic_long_cmpxchg(done, old, ugen);
 	}
+	lufd_unlock(flags);
 }
 
 void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		return;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, mm_cpumask(mm)) {
 		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
 		unsigned long old = atomic_long_read(done);
@@ -309,4 +341,5 @@ void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
 		 */
 		atomic_long_cmpxchg(done, old, ugen);
 	}
+	lufd_unlock(flags);
 }
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 593f10aabd45a..414bcabb23b51 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -695,12 +695,22 @@ static inline pud_t pud_mkyoung(pud_t pud)
 	return pud_set_flags(pud, _PAGE_ACCESSED);
 }
 
+#ifdef CONFIG_LUF_DEBUG
+pud_t pud_mkwrite(pud_t pud);
+static inline pud_t __pud_mkwrite(pud_t pud)
+{
+	pud = pud_set_flags(pud, _PAGE_RW);
+
+	return pud_clear_saveddirty(pud);
+}
+#else
 static inline pud_t pud_mkwrite(pud_t pud)
 {
 	pud = pud_set_flags(pud, _PAGE_RW);
 
 	return pud_clear_saveddirty(pud);
 }
+#endif
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline int pte_soft_dirty(pte_t pte)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 1fc5bacd72dff..2825f4befb272 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -297,6 +297,9 @@ extern bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, un
 extern bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
 extern void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
 extern void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen);
+#ifdef CONFIG_LUF_DEBUG
+extern void print_lufd_arch(void);
+#endif
 
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 5745a354a241c..f72e4cfdb0a8d 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -901,6 +901,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
+	lufd_check_pages(pte_page(pte), 0);
 	if (vma->vm_flags & VM_SHADOW_STACK)
 		return pte_mkwrite_shstk(pte);
 
@@ -911,6 +912,7 @@ pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 
 pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
+	lufd_check_pages(pmd_page(pmd), PMD_ORDER);
 	if (vma->vm_flags & VM_SHADOW_STACK)
 		return pmd_mkwrite_shstk(pmd);
 
@@ -919,6 +921,14 @@ pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd_clear_saveddirty(pmd);
 }
 
+#ifdef CONFIG_LUF_DEBUG
+pud_t pud_mkwrite(pud_t pud)
+{
+	lufd_check_pages(pud_page(pud), PUD_ORDER);
+	return __pud_mkwrite(pud);
+}
+#endif
+
 void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte)
 {
 	/*
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 975f58fa4b30f..e9ae0d8f73442 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1253,6 +1253,25 @@ static int __init luf_init_arch(void)
 }
 early_initcall(luf_init_arch);
 
+#ifdef CONFIG_LUF_DEBUG
+static DEFINE_SPINLOCK(luf_debug_lock);
+#define lufd_lock(f) spin_lock_irqsave(&luf_debug_lock, (f))
+#define lufd_unlock(f) spin_unlock_irqrestore(&luf_debug_lock, (f))
+
+void print_lufd_arch(void)
+{
+	int cpu;
+
+	pr_cont("LUFD ARCH:");
+	for_each_cpu(cpu, cpu_possible_mask)
+		pr_cont(" %lu", atomic_long_read(per_cpu_ptr(&ugen_done, cpu)));
+	pr_cont("\n");
+}
+#else
+#define lufd_lock(f) do { (void)(f); } while(0)
+#define lufd_unlock(f) do { (void)(f); } while(0)
+#endif
+
 /*
  * batch will not be updated.
  */
@@ -1260,17 +1279,22 @@ bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch,
 			unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		goto out;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		unsigned long done;
 
 		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
-		if (ugen_before(done, ugen))
+		if (ugen_before(done, ugen)) {
+			lufd_unlock(flags);
 			return false;
+		}
 	}
+	lufd_unlock(flags);
 	return true;
 out:
 	return cpumask_empty(&batch->cpumask);
@@ -1280,10 +1304,12 @@ bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
 			unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		goto out;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		unsigned long done;
 
@@ -1291,6 +1317,7 @@ bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
 		if (!ugen_before(done, ugen))
 			cpumask_clear_cpu(cpu, &batch->cpumask);
 	}
+	lufd_unlock(flags);
 out:
 	return cpumask_empty(&batch->cpumask);
 }
@@ -1299,10 +1326,12 @@ void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
 			     unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		return;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
 		unsigned long old = atomic_long_read(done);
@@ -1320,15 +1349,18 @@ void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
 		 */
 		atomic_long_cmpxchg(done, old, ugen);
 	}
+	lufd_unlock(flags);
 }
 
 void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		return;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, mm_cpumask(mm)) {
 		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
 		unsigned long old = atomic_long_read(done);
@@ -1346,6 +1378,7 @@ void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
 		 */
 		atomic_long_cmpxchg(done, old, ugen);
 	}
+	lufd_unlock(flags);
 }
 
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
index dd100e849f5e0..0792530d1be7b 100644
--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -41,6 +41,7 @@ static inline void *kmap(struct page *page)
 {
 	void *addr;
 
+	lufd_check_pages(page, 0);
 	might_sleep();
 	if (!PageHighMem(page))
 		addr = page_address(page);
@@ -161,6 +162,7 @@ static inline struct page *kmap_to_page(void *addr)
 
 static inline void *kmap(struct page *page)
 {
+	lufd_check_pages(page, 0);
 	might_sleep();
 	return page_address(page);
 }
@@ -177,11 +179,13 @@ static inline void kunmap(struct page *page)
 
 static inline void *kmap_local_page(struct page *page)
 {
+	lufd_check_pages(page, 0);
 	return page_address(page);
 }
 
 static inline void *kmap_local_folio(struct folio *folio, size_t offset)
 {
+	lufd_check_folio(folio);
 	return page_address(&folio->page) + offset;
 }
 
@@ -204,6 +208,7 @@ static inline void __kunmap_local(const void *addr)
 
 static inline void *kmap_atomic(struct page *page)
 {
+	lufd_check_pages(page, 0);
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		migrate_disable();
 	else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5c81c9831bc5d..9572fbbb9d73f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -44,6 +44,24 @@ extern int sysctl_page_lock_unfairness;
 void mm_core_init(void);
 void init_mm_internals(void);
 
+#ifdef CONFIG_LUF_DEBUG
+void lufd_check_folio(struct folio *f);
+void lufd_check_pages(const struct page *p, unsigned int order);
+void lufd_check_zone_pages(struct zone *zone, struct page *page, unsigned int order);
+void lufd_check_queued_pages(void);
+void lufd_queue_page_for_check(struct page *page, int order);
+void lufd_mark_folio(struct folio *f, unsigned short luf_key);
+void lufd_mark_pages(struct page *p, unsigned int order, unsigned short luf_key);
+#else
+static inline void lufd_check_folio(struct folio *f) {}
+static inline void lufd_check_pages(const struct page *p, unsigned int order) {}
+static inline void lufd_check_zone_pages(struct zone *zone, struct page *page, unsigned int order) {}
+static inline void lufd_check_queued_pages(void) {}
+static inline void lufd_queue_page_for_check(struct page *page, int order) {}
+static inline void lufd_mark_folio(struct folio *f, unsigned short luf_key) {}
+static inline void lufd_mark_pages(struct page *p, unsigned int order, unsigned short luf_key) {}
+#endif
+
 #ifndef CONFIG_NUMA		/* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
 
@@ -113,7 +131,7 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #endif
 
 #ifndef page_to_virt
-#define page_to_virt(x)	__va(PFN_PHYS(page_to_pfn(x)))
+#define page_to_virt(x)	({ lufd_check_pages(x, 0); __va(PFN_PHYS(page_to_pfn(x)));})
 #endif
 
 #ifndef lm_alias
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e3132e1e5e5d2..e0c5712dc46ff 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -22,6 +22,10 @@
 
 #include <asm/mmu.h>
 
+#ifdef CONFIG_LUF_DEBUG
+extern struct page_ext_operations luf_debug_ops;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -32,18 +36,6 @@
 struct address_space;
 struct mem_cgroup;
 
-#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-struct luf_batch {
-	struct tlbflush_unmap_batch batch;
-	unsigned long ugen;
-	rwlock_t lock;
-};
-void luf_batch_init(struct luf_batch *lb);
-#else
-struct luf_batch {};
-static inline void luf_batch_init(struct luf_batch *lb) {}
-#endif
-
 /*
  * Each physical page in the system has a struct page associated with
  * it to keep track of whatever it is we are using the page for at the
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index bff5706b76e14..b5dfc451c009b 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -9,6 +9,7 @@
  */
 
 #include <linux/types.h>
+#include <linux/spinlock_types.h>
 
 #include <asm/page.h>
 
@@ -67,4 +68,19 @@ struct tlbflush_unmap_batch {
 #endif
 };
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+struct luf_batch {
+	struct tlbflush_unmap_batch batch;
+	unsigned long ugen;
+	rwlock_t lock;
+};
+void luf_batch_init(struct luf_batch *lb);
+#else
+struct luf_batch {};
+static inline void luf_batch_init(struct luf_batch *lb) {}
+#endif
+
+#if defined(CONFIG_LUF_DEBUG)
+#define NR_LUFD_PAGES 512
+#endif
 #endif /* _LINUX_MM_TYPES_TASK_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 463cb2fb8f919..eb1487fa101e6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1380,6 +1380,11 @@ struct task_struct {
 	unsigned long luf_ugen;
 	unsigned long zone_ugen;
 	unsigned long wait_zone_ugen;
+#if defined(CONFIG_LUF_DEBUG)
+	struct page *lufd_pages[NR_LUFD_PAGES];
+	int lufd_pages_order[NR_LUFD_PAGES];
+	int lufd_pages_nr;
+#endif
 #endif
 
 	struct tlbflush_unmap_batch	tlb_ubc;
diff --git a/mm/highmem.c b/mm/highmem.c
index ef3189b36cadb..a323d5a655bf9 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -576,6 +576,7 @@ void *__kmap_local_page_prot(struct page *page, pgprot_t prot)
 {
 	void *kmap;
 
+	lufd_check_pages(page, 0);
 	/*
 	 * To broaden the usage of the actual kmap_local() machinery always map
 	 * pages when debugging is enabled and the architecture has no problems
diff --git a/mm/memory.c b/mm/memory.c
index c98af5e567e89..89d047867d60d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6124,6 +6124,18 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mapping = vma->vm_file->f_mapping;
 	}
 
+#ifdef CONFIG_LUF_DEBUG
+	if (luf_flush) {
+		/*
+		 * If it has a VM_SHARED mapping, all the mms involved
+		 * in the struct address_space should be luf_flush'ed.
+		 */
+		if (mapping)
+			luf_flush_mapping(mapping);
+		luf_flush_mm(mm);
+	}
+#endif
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
 	else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ccbe49b78190a..c8ab60c60bb08 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -758,6 +758,8 @@ void luf_takeoff_end(struct zone *zone)
 		VM_WARN_ON(current->zone_ugen);
 		VM_WARN_ON(current->wait_zone_ugen);
 	}
+
+	lufd_check_queued_pages();
 }
 
 /*
@@ -853,8 +855,10 @@ bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 		struct luf_batch *lb;
 		unsigned long lb_ugen;
 
-		if (!luf_key)
+		if (!luf_key) {
+			lufd_check_pages(page, buddy_order(page));
 			return true;
+		}
 
 		lb = &luf_batch[luf_key];
 		read_lock_irqsave(&lb->lock, flags);
@@ -875,12 +879,15 @@ bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 
 		if (!current->luf_ugen || ugen_before(current->luf_ugen, lb_ugen))
 			current->luf_ugen = lb_ugen;
+		lufd_queue_page_for_check(page, buddy_order(page));
 		return true;
 	}
 
 	zone_ugen = page_zone_ugen(zone, page);
-	if (!zone_ugen)
+	if (!zone_ugen) {
+		lufd_check_pages(page, buddy_order(page));
 		return true;
+	}
 
 	/*
 	 * Should not be zero since zone-zone_ugen has been updated in
@@ -888,17 +895,23 @@ bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 	 */
 	VM_WARN_ON(!zone->zone_ugen);
 
-	if (!ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen))
+	if (!ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen)) {
+		lufd_check_pages(page, buddy_order(page));
 		return true;
+	}
 
 	if (current->luf_no_shootdown)
 		return false;
 
+	lufd_check_zone_pages(zone, page, buddy_order(page));
+
 	/*
 	 * zone batched flush has been already set.
 	 */
-	if (current->zone_ugen)
+	if (current->zone_ugen) {
+		lufd_queue_page_for_check(page, buddy_order(page));
 		return true;
+	}
 
 	/*
 	 * Others are already performing tlb shootdown for us.  All we
@@ -933,6 +946,7 @@ bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 		atomic_long_set(&zone->nr_luf_pages, 0);
 		fold_batch(tlb_ubc_takeoff, &zone->zone_batch, true);
 	}
+	lufd_queue_page_for_check(page, buddy_order(page));
 	return true;
 }
 #endif
@@ -1238,6 +1252,11 @@ static inline void __free_one_page(struct page *page,
 	} else
 		zone_ugen = page_zone_ugen(zone, page);
 
+	if (!zone_ugen)
+		lufd_check_pages(page, order);
+	else
+		lufd_check_zone_pages(zone, page, order);
+
 	while (order < MAX_PAGE_ORDER) {
 		int buddy_mt = migratetype;
 		unsigned long buddy_zone_ugen;
@@ -1299,6 +1318,10 @@ static inline void __free_one_page(struct page *page,
 		set_page_zone_ugen(page, zone_ugen);
 		pfn = combined_pfn;
 		order++;
+		if (!zone_ugen)
+			lufd_check_pages(page, order);
+		else
+			lufd_check_zone_pages(zone, page, order);
 	}
 
 done_merging:
@@ -3201,6 +3224,8 @@ void free_unref_page(struct page *page, unsigned int order,
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
+	lufd_mark_pages(page, order, luf_key);
+
 	if (!pcp_allowed_order(order)) {
 		__free_pages_ok(page, order, FPI_NONE, luf_key);
 		return;
@@ -3253,6 +3278,7 @@ void free_unref_folios(struct folio_batch *folios, unsigned short luf_key)
 		unsigned long pfn = folio_pfn(folio);
 		unsigned int order = folio_order(folio);
 
+		lufd_mark_folio(folio, luf_key);
 		if (!free_pages_prepare(&folio->page, order))
 			continue;
 		/*
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 641d93f6af4c1..be40bc2a93378 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -89,6 +89,9 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
 #ifdef CONFIG_PAGE_TABLE_CHECK
 	&page_table_check_ops,
 #endif
+#ifdef CONFIG_LUF_DEBUG
+	&luf_debug_ops,
+#endif
 };
 
 unsigned long page_ext_size;
diff --git a/mm/rmap.c b/mm/rmap.c
index 55003eb0b4936..fd6d5cb0fa8d0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1161,6 +1161,235 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 }
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
+#ifdef CONFIG_LUF_DEBUG
+
+static bool need_luf_debug(void)
+{
+	return true;
+}
+
+static void init_luf_debug(void)
+{
+	/* Do nothing */
+}
+
+struct page_ext_operations luf_debug_ops = {
+	.size = sizeof(struct luf_batch),
+	.need = need_luf_debug,
+	.init = init_luf_debug,
+	.need_shared_flags = false,
+};
+
+static bool __lufd_check_zone_pages(struct page *page, int nr,
+		struct tlbflush_unmap_batch *batch, unsigned long ugen)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		struct page_ext *page_ext;
+		struct luf_batch *lb;
+		unsigned long lb_ugen;
+		unsigned long flags;
+		bool ret;
+
+		page_ext = page_ext_get(page + i);
+		if (!page_ext)
+			continue;
+
+		lb = (struct luf_batch *)page_ext_data(page_ext, &luf_debug_ops);
+		write_lock_irqsave(&lb->lock, flags);
+		lb_ugen = lb->ugen;
+		ret = arch_tlbbatch_done(&lb->batch.arch, &batch->arch);
+		write_unlock_irqrestore(&lb->lock, flags);
+		page_ext_put(page_ext);
+
+		if (!ret || ugen_before(ugen, lb_ugen))
+			return false;
+	}
+	return true;
+}
+
+void lufd_check_zone_pages(struct zone *zone, struct page *page, unsigned int order)
+{
+	bool warn;
+	static bool once = false;
+
+	if (!page || !zone)
+		return;
+
+	warn = !__lufd_check_zone_pages(page, 1 << order,
+			&zone->zone_batch, zone->luf_ugen);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) order(%u)\n",
+				atomic_long_read(&luf_ugen), page, order);
+		print_lufd_arch();
+	}
+}
+
+static bool __lufd_check_pages(const struct page *page, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		struct page_ext *page_ext;
+		struct luf_batch *lb;
+		unsigned long lb_ugen;
+		unsigned long flags;
+		bool ret;
+
+		page_ext = page_ext_get(page + i);
+		if (!page_ext)
+			continue;
+
+		lb = (struct luf_batch *)page_ext_data(page_ext, &luf_debug_ops);
+		write_lock_irqsave(&lb->lock, flags);
+		lb_ugen = lb->ugen;
+		ret = arch_tlbbatch_diet(&lb->batch.arch, lb_ugen);
+		write_unlock_irqrestore(&lb->lock, flags);
+		page_ext_put(page_ext);
+
+		if (!ret)
+			return false;
+	}
+	return true;
+}
+
+void lufd_queue_page_for_check(struct page *page, int order)
+{
+	struct page **parray = current->lufd_pages;
+	int *oarray = current->lufd_pages_order;
+
+	if (!page)
+		return;
+
+	if (current->lufd_pages_nr >= NR_LUFD_PAGES) {
+		VM_WARN_ONCE(1, "LUFD: NR_LUFD_PAGES is too small.\n");
+		return;
+	}
+
+	*(parray + current->lufd_pages_nr) = page;
+	*(oarray + current->lufd_pages_nr) = order;
+	current->lufd_pages_nr++;
+}
+
+void lufd_check_queued_pages(void)
+{
+	struct page **parray = current->lufd_pages;
+	int *oarray = current->lufd_pages_order;
+	int i;
+
+	for (i = 0; i < current->lufd_pages_nr; i++)
+		lufd_check_pages(*(parray + i), *(oarray + i));
+	current->lufd_pages_nr = 0;
+}
+
+void lufd_check_folio(struct folio *folio)
+{
+	struct page *page;
+	int nr;
+	bool warn;
+	static bool once = false;
+
+	if (!folio)
+		return;
+
+	page = folio_page(folio, 0);
+	nr = folio_nr_pages(folio);
+
+	warn = !__lufd_check_pages(page, nr);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) nr(%d)\n",
+				atomic_long_read(&luf_ugen), page, nr);
+		print_lufd_arch();
+	}
+}
+EXPORT_SYMBOL(lufd_check_folio);
+
+void lufd_check_pages(const struct page *page, unsigned int order)
+{
+	bool warn;
+	static bool once = false;
+
+	if (!page)
+		return;
+
+	warn = !__lufd_check_pages(page, 1 << order);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) order(%u)\n",
+				atomic_long_read(&luf_ugen), page, order);
+		print_lufd_arch();
+	}
+}
+EXPORT_SYMBOL(lufd_check_pages);
+
+static void __lufd_mark_pages(struct page *page, int nr, unsigned short luf_key)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		struct page_ext *page_ext;
+		struct luf_batch *lb;
+
+		page_ext = page_ext_get(page + i);
+		if (!page_ext)
+			continue;
+
+		lb = (struct luf_batch *)page_ext_data(page_ext, &luf_debug_ops);
+		fold_luf_batch(lb, &luf_batch[luf_key]);
+		page_ext_put(page_ext);
+	}
+}
+
+void lufd_mark_folio(struct folio *folio, unsigned short luf_key)
+{
+	struct page *page;
+	int nr;
+	bool warn;
+	static bool once = false;
+
+	if (!luf_key)
+		return;
+
+	page = folio_page(folio, 0);
+	nr = folio_nr_pages(folio);
+
+	warn = !__lufd_check_pages(page, nr);
+	__lufd_mark_pages(page, nr, luf_key);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) nr(%d)\n",
+				atomic_long_read(&luf_ugen), page, nr);
+		print_lufd_arch();
+	}
+}
+
+void lufd_mark_pages(struct page *page, unsigned int order, unsigned short luf_key)
+{
+	bool warn;
+	static bool once = false;
+
+	if (!luf_key)
+		return;
+
+	warn = !__lufd_check_pages(page, 1 << order);
+	__lufd_mark_pages(page, 1 << order, luf_key);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) order(%u)\n",
+				atomic_long_read(&luf_ugen), page, order);
+		print_lufd_arch();
+	}
+}
+#endif
+
 /**
  * page_address_in_vma - The virtual address of a page in this VMA.
  * @folio: The folio containing the page.
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (25 preceding siblings ...)
  2025-02-20  5:20 ` [RFC PATCH v12 26/26] mm/luf: implement luf debug feature Byungchul Park
@ 2025-02-20 10:32 ` Hillf Danton
  2025-02-20 10:51   ` Byungchul Park
  2025-02-20 11:09   ` Byungchul Park
  2025-02-20 15:15 ` Dave Hansen
  27 siblings, 2 replies; 102+ messages in thread
From: Hillf Danton @ 2025-02-20 10:32 UTC (permalink / raw)
  To: Byungchul Park; +Cc: linux-kernel, linux-mm

On Thu, 20 Feb 2025 14:20:01 +0900 Byungchul Park <byungchul@sk.com>
> To check luf's stability, I ran a heavy LLM inference workload consuming
> 210GiB over 7 days on a machine with 140GiB memory, and decided it's
> stable enough.
> 
> I'm posting the latest version so that anyone can try luf mechanism if
> wanted by any chance.  However, I tagged RFC again because there are
> still issues that should be resolved to merge to mainline:
> 
>    1. Even though system wide total cpu time for TLB shootdown is
>       reduced over 95%, page allocation paths should take additional cpu
>       time shifted from page reclaim to perform TLB shootdown.
> 
>    2. We need luf debug feature to detect when luf goes wrong by any
>       chance.  I implemented just a draft version that checks the sanity
>       on mkwrite(), kmap(), and so on.  I need to gather better ideas
>       to improve the debug feature.
> 
> ---
> 
> Hi everyone,
> 
> While I'm working with a tiered memory system e.g. CXL memory, I have
> been facing migration overhead esp. tlb shootdown on promotion or
> demotion between different tiers.  Yeah..  most tlb shootdowns on
> migration through hinting fault can be avoided thanks to Huang Ying's
> work, commit 4d4b6d66db ("mm,unmap: avoid flushing tlb in batch if PTE
> is inaccessible").
> 
> However, it's only for migration through hinting fault.  I thought it'd
> be much better if we have a general mechanism to reduce all the tlb
> numbers that we can apply to any unmap code, that we normally believe
> tlb flush should be followed.
> 
> I'm suggesting a new mechanism, LUF(Lazy Unmap Flush), that defers tlb
> flush until folios that have been unmapped and freed, eventually get
> allocated again.  It's safe for folios that had been mapped read-only
> and were unmapped, as long as the contents of the folios don't change
> while staying in pcp or buddy so we can still read the data through the
> stale tlb entries.
>
Given pcp or buddy, you are opening window for use after free which makes
no sense in 99% cases.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 10:32 ` [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Hillf Danton
@ 2025-02-20 10:51   ` Byungchul Park
  2025-02-20 11:09   ` Byungchul Park
  1 sibling, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20 10:51 UTC (permalink / raw)
  To: Hillf Danton; +Cc: linux-kernel, linux-mm, kernel_team

On Thu, Feb 20, 2025 at 06:32:22PM +0800, Hillf Danton wrote:
> On Thu, 20 Feb 2025 14:20:01 +0900 Byungchul Park <byungchul@sk.com>
> > To check luf's stability, I ran a heavy LLM inference workload consuming
> > 210GiB over 7 days on a machine with 140GiB memory, and decided it's
> > stable enough.
> > 
> > I'm posting the latest version so that anyone can try luf mechanism if
> > wanted by any chance.  However, I tagged RFC again because there are
> > still issues that should be resolved to merge to mainline:
> > 
> >    1. Even though system wide total cpu time for TLB shootdown is
> >       reduced over 95%, page allocation paths should take additional cpu
> >       time shifted from page reclaim to perform TLB shootdown.
> > 
> >    2. We need luf debug feature to detect when luf goes wrong by any
> >       chance.  I implemented just a draft version that checks the sanity
> >       on mkwrite(), kmap(), and so on.  I need to gather better ideas
> >       to improve the debug feature.
> > 
> > ---
> > 
> > Hi everyone,
> > 
> > While I'm working with a tiered memory system e.g. CXL memory, I have
> > been facing migration overhead esp. tlb shootdown on promotion or
> > demotion between different tiers.  Yeah..  most tlb shootdowns on
> > migration through hinting fault can be avoided thanks to Huang Ying's
> > work, commit 4d4b6d66db ("mm,unmap: avoid flushing tlb in batch if PTE
> > is inaccessible").
> > 
> > However, it's only for migration through hinting fault.  I thought it'd
> > be much better if we have a general mechanism to reduce all the tlb
> > numbers that we can apply to any unmap code, that we normally believe
> > tlb flush should be followed.
> > 
> > I'm suggesting a new mechanism, LUF(Lazy Unmap Flush), that defers tlb
> > flush until folios that have been unmapped and freed, eventually get
> > allocated again.  It's safe for folios that had been mapped read-only
> > and were unmapped, as long as the contents of the folios don't change
> > while staying in pcp or buddy so we can still read the data through the
> > stale tlb entries.
> >
> Given pcp or buddy, you are opening window for use after free which makes
> no sense in 99% cases.

It's kinda 'use(= read only) after free' but luf ensures the data of the
interesting pages doesn't change.  That's what luf works on.

	Byungchul


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 10:32 ` [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Hillf Danton
  2025-02-20 10:51   ` Byungchul Park
@ 2025-02-20 11:09   ` Byungchul Park
  2025-02-20 11:49     ` Hillf Danton
  1 sibling, 1 reply; 102+ messages in thread
From: Byungchul Park @ 2025-02-20 11:09 UTC (permalink / raw)
  To: Hillf Danton; +Cc: linux-kernel, linux-mm, kernel_team

On Thu, Feb 20, 2025 at 06:32:22PM +0800, Hillf Danton wrote:
> On Thu, 20 Feb 2025 14:20:01 +0900 Byungchul Park <byungchul@sk.com>
> > To check luf's stability, I ran a heavy LLM inference workload consuming
> > 210GiB over 7 days on a machine with 140GiB memory, and decided it's
> > stable enough.
> > 
> > I'm posting the latest version so that anyone can try luf mechanism if
> > wanted by any chance.  However, I tagged RFC again because there are
> > still issues that should be resolved to merge to mainline:
> > 
> >    1. Even though system wide total cpu time for TLB shootdown is
> >       reduced over 95%, page allocation paths should take additional cpu
> >       time shifted from page reclaim to perform TLB shootdown.
> > 
> >    2. We need luf debug feature to detect when luf goes wrong by any
> >       chance.  I implemented just a draft version that checks the sanity
> >       on mkwrite(), kmap(), and so on.  I need to gather better ideas
> >       to improve the debug feature.
> > 
> > ---
> > 
> > Hi everyone,
> > 
> > While I'm working with a tiered memory system e.g. CXL memory, I have
> > been facing migration overhead esp. tlb shootdown on promotion or
> > demotion between different tiers.  Yeah..  most tlb shootdowns on
> > migration through hinting fault can be avoided thanks to Huang Ying's
> > work, commit 4d4b6d66db ("mm,unmap: avoid flushing tlb in batch if PTE
> > is inaccessible").
> > 
> > However, it's only for migration through hinting fault.  I thought it'd
> > be much better if we have a general mechanism to reduce all the tlb
> > numbers that we can apply to any unmap code, that we normally believe
> > tlb flush should be followed.
> > 
> > I'm suggesting a new mechanism, LUF(Lazy Unmap Flush), that defers tlb
> > flush until folios that have been unmapped and freed, eventually get
> > allocated again.  It's safe for folios that had been mapped read-only
> > and were unmapped, as long as the contents of the folios don't change
> > while staying in pcp or buddy so we can still read the data through the
> > stale tlb entries.
> >
> Given pcp or buddy, you are opening window for use after free which makes
> no sense in 99% cases.

Just in case that I don't understand what you meant and for better
understanding, can you provide a simple and problematic example from
the u-a-f?

	Byungchul


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 11:09   ` Byungchul Park
@ 2025-02-20 11:49     ` Hillf Danton
  2025-02-20 12:20       ` Byungchul Park
                         ` (4 more replies)
  0 siblings, 5 replies; 102+ messages in thread
From: Hillf Danton @ 2025-02-20 11:49 UTC (permalink / raw)
  To: Byungchul Park; +Cc: linux-kernel, linux-mm, kernel_team

On Thu, 20 Feb 2025 20:09:35 +0900 Byungchul Park wrote:
> On Thu, Feb 20, 2025 at 06:32:22PM +0800, Hillf Danton wrote:
> > On Thu, 20 Feb 2025 14:20:01 +0900 Byungchul Park <byungchul@sk.com>
> > > To check luf's stability, I ran a heavy LLM inference workload consuming
> > > 210GiB over 7 days on a machine with 140GiB memory, and decided it's
> > > stable enough.
> > > 
> > > I'm posting the latest version so that anyone can try luf mechanism if
> > > wanted by any chance.  However, I tagged RFC again because there are
> > > still issues that should be resolved to merge to mainline:
> > > 
> > >    1. Even though system wide total cpu time for TLB shootdown is
> > >       reduced over 95%, page allocation paths should take additional cpu
> > >       time shifted from page reclaim to perform TLB shootdown.
> > > 
> > >    2. We need luf debug feature to detect when luf goes wrong by any
> > >       chance.  I implemented just a draft version that checks the sanity
> > >       on mkwrite(), kmap(), and so on.  I need to gather better ideas
> > >       to improve the debug feature.
> > > 
> > > ---
> > > 
> > > Hi everyone,
> > > 
> > > While I'm working with a tiered memory system e.g. CXL memory, I have
> > > been facing migration overhead esp. tlb shootdown on promotion or
> > > demotion between different tiers.  Yeah..  most tlb shootdowns on
> > > migration through hinting fault can be avoided thanks to Huang Ying's
> > > work, commit 4d4b6d66db ("mm,unmap: avoid flushing tlb in batch if PTE
> > > is inaccessible").
> > > 
> > > However, it's only for migration through hinting fault.  I thought it'd
> > > be much better if we have a general mechanism to reduce all the tlb
> > > numbers that we can apply to any unmap code, that we normally believe
> > > tlb flush should be followed.
> > > 
> > > I'm suggesting a new mechanism, LUF(Lazy Unmap Flush), that defers tlb
> > > flush until folios that have been unmapped and freed, eventually get
> > > allocated again.  It's safe for folios that had been mapped read-only
> > > and were unmapped, as long as the contents of the folios don't change
> > > while staying in pcp or buddy so we can still read the data through the
> > > stale tlb entries.
> > >
> > Given pcp or buddy, you are opening window for use after free which makes
> > no sense in 99% cases.
> 
> Just in case that I don't understand what you meant and for better
> understanding, can you provide a simple and problematic example from
> the u-a-f?
> 
Tell us if it is illegal to commit rape without pregnancy in your home town?

PS defering flushing tlb [1,2] is no go.

Subject: Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
[1] https://lore.kernel.org/lkml/20250127155146.GB25757@willie-the-truck/
[2] https://lore.kernel.org/lkml/xhsmhwmdwihte.mognet@vschneid-thinkpadt14sgen2i.remote.csb/


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 11:49     ` Hillf Danton
@ 2025-02-20 12:20       ` Byungchul Park
  2025-02-20 12:40       ` Byungchul Park
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20 12:20 UTC (permalink / raw)
  To: Hillf Danton; +Cc: linux-kernel, linux-mm, kernel_team

On Thu, Feb 20, 2025 at 07:49:19PM +0800, Hillf Danton wrote:
> On Thu, 20 Feb 2025 20:09:35 +0900 Byungchul Park wrote:
> > On Thu, Feb 20, 2025 at 06:32:22PM +0800, Hillf Danton wrote:
> > > On Thu, 20 Feb 2025 14:20:01 +0900 Byungchul Park <byungchul@sk.com>
> > > > To check luf's stability, I ran a heavy LLM inference workload consuming
> > > > 210GiB over 7 days on a machine with 140GiB memory, and decided it's
> > > > stable enough.
> > > > 
> > > > I'm posting the latest version so that anyone can try luf mechanism if
> > > > wanted by any chance.  However, I tagged RFC again because there are
> > > > still issues that should be resolved to merge to mainline:
> > > > 
> > > >    1. Even though system wide total cpu time for TLB shootdown is
> > > >       reduced over 95%, page allocation paths should take additional cpu
> > > >       time shifted from page reclaim to perform TLB shootdown.
> > > > 
> > > >    2. We need luf debug feature to detect when luf goes wrong by any
> > > >       chance.  I implemented just a draft version that checks the sanity
> > > >       on mkwrite(), kmap(), and so on.  I need to gather better ideas
> > > >       to improve the debug feature.
> > > > 
> > > > ---
> > > > 
> > > > Hi everyone,
> > > > 
> > > > While I'm working with a tiered memory system e.g. CXL memory, I have
> > > > been facing migration overhead esp. tlb shootdown on promotion or
> > > > demotion between different tiers.  Yeah..  most tlb shootdowns on
> > > > migration through hinting fault can be avoided thanks to Huang Ying's
> > > > work, commit 4d4b6d66db ("mm,unmap: avoid flushing tlb in batch if PTE
> > > > is inaccessible").
> > > > 
> > > > However, it's only for migration through hinting fault.  I thought it'd
> > > > be much better if we have a general mechanism to reduce all the tlb
> > > > numbers that we can apply to any unmap code, that we normally believe
> > > > tlb flush should be followed.
> > > > 
> > > > I'm suggesting a new mechanism, LUF(Lazy Unmap Flush), that defers tlb
> > > > flush until folios that have been unmapped and freed, eventually get
> > > > allocated again.  It's safe for folios that had been mapped read-only
> > > > and were unmapped, as long as the contents of the folios don't change
> > > > while staying in pcp or buddy so we can still read the data through the
> > > > stale tlb entries.
> > > >
> > > Given pcp or buddy, you are opening window for use after free which makes
> > > no sense in 99% cases.
> > 
> > Just in case that I don't understand what you meant and for better
> > understanding, can you provide a simple and problematic example from
> > the u-a-f?
> > 
> Tell us if it is illegal to commit rape without pregnancy in your home town?

Memory overcommit also looked cheating to someone like you.  You
definitely think it'd be totally non-sense that each task believes it
can use its own full virtual space.

We say uaf is illegal only when it can cause access the free area
without *appropriate permission*.

> PS defering flushing tlb [1,2] is no go.

I will check this shortly.

	Byungchul
> 
> Subject: Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
> [1] https://lore.kernel.org/lkml/20250127155146.GB25757@willie-the-truck/
> [2] https://lore.kernel.org/lkml/xhsmhwmdwihte.mognet@vschneid-thinkpadt14sgen2i.remote.csb/


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 11:49     ` Hillf Danton
  2025-02-20 12:20       ` Byungchul Park
@ 2025-02-20 12:40       ` Byungchul Park
  2025-02-20 13:54       ` Matthew Wilcox
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20 12:40 UTC (permalink / raw)
  To: Hillf Danton, torvalds; +Cc: linux-kernel, linux-mm, kernel_team

On Thu, Feb 20, 2025 at 07:49:19PM +0800, Hillf Danton wrote:
> On Thu, 20 Feb 2025 20:09:35 +0900 Byungchul Park wrote:
> > On Thu, Feb 20, 2025 at 06:32:22PM +0800, Hillf Danton wrote:
> > > On Thu, 20 Feb 2025 14:20:01 +0900 Byungchul Park <byungchul@sk.com>
> > > > To check luf's stability, I ran a heavy LLM inference workload consuming
> > > > 210GiB over 7 days on a machine with 140GiB memory, and decided it's
> > > > stable enough.
> > > > 
> > > > I'm posting the latest version so that anyone can try luf mechanism if
> > > > wanted by any chance.  However, I tagged RFC again because there are
> > > > still issues that should be resolved to merge to mainline:
> > > > 
> > > >    1. Even though system wide total cpu time for TLB shootdown is
> > > >       reduced over 95%, page allocation paths should take additional cpu
> > > >       time shifted from page reclaim to perform TLB shootdown.
> > > > 
> > > >    2. We need luf debug feature to detect when luf goes wrong by any
> > > >       chance.  I implemented just a draft version that checks the sanity
> > > >       on mkwrite(), kmap(), and so on.  I need to gather better ideas
> > > >       to improve the debug feature.
> > > > 
> > > > ---
> > > > 
> > > > Hi everyone,
> > > > 
> > > > While I'm working with a tiered memory system e.g. CXL memory, I have
> > > > been facing migration overhead esp. tlb shootdown on promotion or
> > > > demotion between different tiers.  Yeah..  most tlb shootdowns on
> > > > migration through hinting fault can be avoided thanks to Huang Ying's
> > > > work, commit 4d4b6d66db ("mm,unmap: avoid flushing tlb in batch if PTE
> > > > is inaccessible").
> > > > 
> > > > However, it's only for migration through hinting fault.  I thought it'd
> > > > be much better if we have a general mechanism to reduce all the tlb
> > > > numbers that we can apply to any unmap code, that we normally believe
> > > > tlb flush should be followed.
> > > > 
> > > > I'm suggesting a new mechanism, LUF(Lazy Unmap Flush), that defers tlb
> > > > flush until folios that have been unmapped and freed, eventually get
> > > > allocated again.  It's safe for folios that had been mapped read-only
> > > > and were unmapped, as long as the contents of the folios don't change
> > > > while staying in pcp or buddy so we can still read the data through the
> > > > stale tlb entries.
> > > >
> > > Given pcp or buddy, you are opening window for use after free which makes
> > > no sense in 99% cases.
> > 
> > Just in case that I don't understand what you meant and for better
> > understanding, can you provide a simple and problematic example from
> > the u-a-f?
> > 
> Tell us if it is illegal to commit rape without pregnancy in your home town?

+to Torvalds

Logical blame is welcome but I don't want to see potty-mouthed busters
like him in Linux community any more.  Please, ban him for better
community.

	Byungchul

> PS defering flushing tlb [1,2] is no go.
> 
> Subject: Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs
> [1] https://lore.kernel.org/lkml/20250127155146.GB25757@willie-the-truck/
> [2] https://lore.kernel.org/lkml/xhsmhwmdwihte.mognet@vschneid-thinkpadt14sgen2i.remote.csb/


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 11:49     ` Hillf Danton
  2025-02-20 12:20       ` Byungchul Park
  2025-02-20 12:40       ` Byungchul Park
@ 2025-02-20 13:54       ` Matthew Wilcox
  2025-02-20 15:09         ` Steven Rostedt
  2025-03-10 23:24       ` Dan Williams
       [not found]       ` <20250619134922.1219-1-hdanton@sina.com>
  4 siblings, 1 reply; 102+ messages in thread
From: Matthew Wilcox @ 2025-02-20 13:54 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Byungchul Park, linux-kernel, linux-mm, kernel_team, conduct

On Thu, Feb 20, 2025 at 07:49:19PM +0800, Hillf Danton wrote:
> On Thu, 20 Feb 2025 20:09:35 +0900 Byungchul Park wrote:
> > Just in case that I don't understand what you meant and for better
> > understanding, can you provide a simple and problematic example from
> > the u-a-f?
> > 
> Tell us if it is illegal to commit rape without pregnancy in your home town?

Hillf, this is unacceptable language.  You need to apologise.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 13:54       ` Matthew Wilcox
@ 2025-02-20 15:09         ` Steven Rostedt
  2025-02-20 22:53           ` Kent Overstreet
  2025-02-20 23:25           ` Hillf Danton
  0 siblings, 2 replies; 102+ messages in thread
From: Steven Rostedt @ 2025-02-20 15:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hillf Danton, Byungchul Park, linux-kernel, linux-mm,
	kernel_team, conduct

On Thu, Feb 20, 2025 at 01:54:13PM +0000, Matthew Wilcox wrote:
> On Thu, Feb 20, 2025 at 07:49:19PM +0800, Hillf Danton wrote:
> > On Thu, 20 Feb 2025 20:09:35 +0900 Byungchul Park wrote:
> > > Just in case that I don't understand what you meant and for better
> > > understanding, can you provide a simple and problematic example from
> > > the u-a-f?
> > > 
> > Tell us if it is illegal to commit rape without pregnancy in your home town?
> 
> Hillf, this is unacceptable language.  You need to apologise.

Agreed. WTF Hillf?  Where did that come from? Is this how you talk to your
co-workers?

I'll tell you what would happen in my home town. If someone said
that to a co-worker, they would likely be terminated.

-- Steve



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 15:09         ` Steven Rostedt
@ 2025-02-20 22:53           ` Kent Overstreet
  2025-02-20 23:05             ` Steven Rostedt
  2025-02-20 23:25           ` Hillf Danton
  1 sibling, 1 reply; 102+ messages in thread
From: Kent Overstreet @ 2025-02-20 22:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Matthew Wilcox, Hillf Danton, Byungchul Park, linux-kernel,
	linux-mm, kernel_team, conduct

On Thu, Feb 20, 2025 at 10:09:45AM -0500, Steven Rostedt wrote:
> On Thu, Feb 20, 2025 at 01:54:13PM +0000, Matthew Wilcox wrote:
> > On Thu, Feb 20, 2025 at 07:49:19PM +0800, Hillf Danton wrote:
> > > On Thu, 20 Feb 2025 20:09:35 +0900 Byungchul Park wrote:
> > > > Just in case that I don't understand what you meant and for better
> > > > understanding, can you provide a simple and problematic example from
> > > > the u-a-f?
> > > > 
> > > Tell us if it is illegal to commit rape without pregnancy in your home town?
> > 
> > Hillf, this is unacceptable language.  You need to apologise.
> 
> Agreed. WTF Hillf?  Where did that come from? Is this how you talk to your
> co-workers?
> 
> I'll tell you what would happen in my home town. If someone said
> that to a co-worker, they would likely be terminated.

I can't agree with the "this is a firing offence" approach.

We're a community, no one is employed by anyone else here; we work
together because we have to and we have to figure out how to get along.
We work via consensus, not appeals to authority.

However - language like that can _and has_ driven away valuable and
long standing members of our community, so people will feel strongly
about this.

Hillf, I'm going to share a story off list.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 22:53           ` Kent Overstreet
@ 2025-02-20 23:05             ` Steven Rostedt
  2025-02-20 23:21               ` Kent Overstreet
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2025-02-20 23:05 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Hillf Danton, Byungchul Park, linux-kernel,
	linux-mm, kernel_team, conduct

On Thu, 20 Feb 2025 17:53:41 -0500
Kent Overstreet <kent.overstreet@linux.dev> wrote:

> > I'll tell you what would happen in my home town. If someone said
> > that to a co-worker, they would likely be terminated.  
> 
> I can't agree with the "this is a firing offence" approach.

My point was, if this was in a company, it could very well be a firing offense.

> 
> We're a community, no one is employed by anyone else here; we work
> together because we have to and we have to figure out how to get along.
> We work via consensus, not appeals to authority.

As a community, yes, things are different. But we should not have to
tolerate such language.


-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 23:05             ` Steven Rostedt
@ 2025-02-20 23:21               ` Kent Overstreet
  0 siblings, 0 replies; 102+ messages in thread
From: Kent Overstreet @ 2025-02-20 23:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Matthew Wilcox, Hillf Danton, Byungchul Park, linux-kernel,
	linux-mm, kernel_team, conduct

On Thu, Feb 20, 2025 at 06:05:39PM -0500, Steven Rostedt wrote:
> On Thu, 20 Feb 2025 17:53:41 -0500
> Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > > I'll tell you what would happen in my home town. If someone said
> > > that to a co-worker, they would likely be terminated.  
> > 
> > I can't agree with the "this is a firing offence" approach.
> 
> My point was, if this was in a company, it could very well be a firing offense.

Well, to a white color worker, yes. But to someone working a blue collor
safety critical industry, that's going to come across as rather tame.

(And I do get annoyed when people get overly focused on language and
forget that _we_ are a safety critical industry. To a first
approximation, all the critical infrastructure throughout the world runs
on Linux, stuff that doesn't is a rounding error, and all the testing
and validation that exists only provides a safety factor. We have to
have our shit together, and that does need to come first).

That aside - my point isn't about what should and shouldn't be allowed,
it's just that norms are arbitrary and it's not the best argument if you
want someone to change their behavior.

> > We're a community, no one is employed by anyone else here; we work
> > together because we have to and we have to figure out how to get along.
> > We work via consensus, not appeals to authority.
> 
> As a community, yes, things are different. But we should not have to
> tolerate such language.

Agreed.

And I think we're all aware at this point at how that sort of thing does
drive people away, so best not take it so far people start to consider
you a liability - or one way or another there's going to be an "or else".

This place functions by making people feel respected and valued for the
work they do, so a degree of respect and consideration is required.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 15:09         ` Steven Rostedt
  2025-02-20 22:53           ` Kent Overstreet
@ 2025-02-20 23:25           ` Hillf Danton
  2025-02-20 23:44             ` Steven Rostedt
       [not found]             ` <20250221230556.2479-1-hdanton@sina.com>
  1 sibling, 2 replies; 102+ messages in thread
From: Hillf Danton @ 2025-02-20 23:25 UTC (permalink / raw)
  To: Steven Rostedt, Matthew Wilcox
  Cc: Byungchul Park, linux-kernel, linux-mm, kernel_team, conduct

On Thu, 20 Feb 2025 10:09:45 -0500 Steven Rostedt <rostedt@goodmis.org>
> On Thu, Feb 20, 2025 at 01:54:13PM +0000, Matthew Wilcox wrote:
> > On Thu, Feb 20, 2025 at 07:49:19PM +0800, Hillf Danton wrote:
> > > On Thu, 20 Feb 2025 20:09:35 +0900 Byungchul Park wrote:
> > > > Just in case that I don't understand what you meant and for better
> > > > understanding, can you provide a simple and problematic example from
> > > > the u-a-f?
> > > > 
> > > Tell us if it is illegal to commit rape without pregnancy in your home town?
> > 
> > Hillf, this is unacceptable language.  You need to apologise.
> 
> Agreed. WTF Hillf?  Where did that come from? Is this how you talk to your
> co-workers?
> 
> I'll tell you what would happen in my home town. If someone said
> that to a co-worker, they would likely be terminated.
> 
Interesting, I want to know if the three words, rape, pregnancy and WTK,
could be used before judge in your hometown court by anyone like your lawyer.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 23:25           ` Hillf Danton
@ 2025-02-20 23:44             ` Steven Rostedt
       [not found]             ` <20250221230556.2479-1-hdanton@sina.com>
  1 sibling, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2025-02-20 23:44 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Matthew Wilcox, Byungchul Park, linux-kernel, linux-mm,
	kernel_team, conduct

On Fri, 21 Feb 2025 07:25:02 +0800
Hillf Danton <hdanton@sina.com> wrote:

> > I'll tell you what would happen in my home town. If someone said
> > that to a co-worker, they would likely be terminated.
> >   
> Interesting, I want to know if the three words, rape, pregnancy and WTK,
> could be used before judge in your hometown court by anyone like your lawyer.

Hillf,

This isn't a court. And there's no reason to use the word "rape" in a
technical conversation on the Linux kernel mailing list. Perhaps a person
reading this was a victim of rape. How do you think that would make them
feel? Welcomed to our community? Absolutely not. Which is why it's totally
unacceptable.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

[parent not found: <20250221230556.2479-1-hdanton@sina.com>]

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
       [not found]             ` <20250221230556.2479-1-hdanton@sina.com>
@ 2025-02-22  7:16               ` Greg KH
       [not found]               ` <20250222101100.2531-1-hdanton@sina.com>
  1 sibling, 0 replies; 102+ messages in thread
From: Greg KH @ 2025-02-22  7:16 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Steven Rostedt, Matthew Wilcox, Byungchul Park, linux-kernel,
	linux-mm, kernel_team, conduct

On Sat, Feb 22, 2025 at 07:05:26AM +0800, Hillf Danton wrote:
> On Thu, 20 Feb 2025 18:44:12 -0500 Steven Rostedt <rostedt@goodmis.org>
> > On Fri, 21 Feb 2025 07:25:02 +0800 Hillf Danton <hdanton@sina.com> wrote:
> > > > I'll tell you what would happen in my home town. If someone said
> > > > that to a co-worker, they would likely be terminated.
> > > >   
> > > Interesting, I want to know if the three words, rape, pregnancy and WTK,
> > > could be used before judge in your hometown court by anyone like your lawyer.
> > 
> > This isn't a court. And there's no reason to use the word "rape" in a
> > technical conversation on the Linux kernel mailing list. Perhaps a person
> > reading this was a victim of rape. How do you think that would make them
> > feel? Welcomed to our community? Absolutely not. Which is why it's totally
> > unacceptable.
> > 
> There are NAK victims. Did you nak more than twice a week, Steve?

Hillf,

This is not the way to work with your fellow developers in the community
to express disagreements. I would recommend following up with an
apology.

thanks,

greg k-h (On behalf of the Code of Conduct Committee)


^ permalink raw reply	[flat|nested] 102+ messages in thread

[parent not found: <20250222101100.2531-1-hdanton@sina.com>]

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
       [not found]               ` <20250222101100.2531-1-hdanton@sina.com>
@ 2025-02-22 13:57                 ` Greg KH
  0 siblings, 0 replies; 102+ messages in thread
From: Greg KH @ 2025-02-22 13:57 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Steven Rostedt, Matthew Wilcox, Byungchul Park, linux-kernel,
	linux-mm, conduct

On Sat, Feb 22, 2025 at 06:10:59PM +0800, Hillf Danton wrote:
> On Sat, 22 Feb 2025 08:16:09 +0100 Greg KH <gregkh@linuxfoundation.org>
> > On Sat, Feb 22, 2025 at 07:05:26AM +0800, Hillf Danton wrote:
> > > On Thu, 20 Feb 2025 18:44:12 -0500 Steven Rostedt <rostedt@goodmis.org>
> > > > On Fri, 21 Feb 2025 07:25:02 +0800 Hillf Danton <hdanton@sina.com> wrote:
> > > > > > I'll tell you what would happen in my home town. If someone said
> > > > > > that to a co-worker, they would likely be terminated.
> > > > > >   
> > > > > Interesting, I want to know if the three words, rape, pregnancy and WTK,
> > > > > could be used before judge in your hometown court by anyone like your lawyer.
> > > > 
> > > > This isn't a court. And there's no reason to use the word "rape" in a
> > > > technical conversation on the Linux kernel mailing list. Perhaps a person
> > > > reading this was a victim of rape. How do you think that would make them
> > > > feel? Welcomed to our community? Absolutely not. Which is why it's totally
> > > > unacceptable.
> > > > 
> > > There are NAK victims. Did you nak more than twice a week, Steve?
> > 
> > This is not the way to work with your fellow developers in the community
> > to express disagreements.
> >
> No comment because you are free to express disagreements.

Disagreements are fine, but not with words like what you used, sorry,
that was unacceptable and requires an apology.

> > I would recommend following up with an apology.
> > 
> It would take some time for me to opt to follow/ignore what you recommended.

Please take the time to do so very soon.

thanks,

greg k-h (On behalf of the Code of Conduct Committee)


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 11:49     ` Hillf Danton
                         ` (2 preceding siblings ...)
  2025-02-20 13:54       ` Matthew Wilcox
@ 2025-03-10 23:24       ` Dan Williams
  2025-03-10 23:53         ` Barry Song
       [not found]       ` <20250619134922.1219-1-hdanton@sina.com>
  4 siblings, 1 reply; 102+ messages in thread
From: Dan Williams @ 2025-03-10 23:24 UTC (permalink / raw)
  To: Hillf Danton, Byungchul Park; +Cc: linux-kernel, linux-mm, kernel_team, conduct

Hillf Danton wrote:
> On Thu, 20 Feb 2025 20:09:35 +0900 Byungchul Park wrote:
> > On Thu, Feb 20, 2025 at 06:32:22PM +0800, Hillf Danton wrote:
> > > On Thu, 20 Feb 2025 14:20:01 +0900 Byungchul Park <byungchul@sk.com>
[..]

Hillf,

The Code of Conduct Committee received reports about your conduct in
this email discussion.

Link to email where the violation took place:

https://lore.kernel.org/lkml/20250220114920.2383-1-hdanton@sina.com/

Our community works on trust and respect and has agreed to abide by the
Code of Conduct:

Reference: https://docs.kernel.org/process/code-of-conduct.html

The Code of Conduct Committee has determined that your written abuse
of another community member required action on your part to repair the
damage to the individual and the community. You took insufficient action
to restore the community's faith in having otherwise productive technical
discussions without the fear of personal attacks.

Following the Code of Conduct Interpretation process the TAB has
approved the following recommendation:

-- Restrict Hillf Danton's participation in the kernel development
   process for 3 months.

       - Scope: Ban Hillf Danton from Linux kernel mailing lists for a
         period of 3 months.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-03-10 23:24       ` Dan Williams
@ 2025-03-10 23:53         ` Barry Song
  0 siblings, 0 replies; 102+ messages in thread
From: Barry Song @ 2025-03-10 23:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Hillf Danton, Byungchul Park, linux-kernel, linux-mm,
	kernel_team, conduct, Nhat Pham

On Tue, Mar 11, 2025 at 12:24 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Hillf Danton wrote:
> > On Thu, 20 Feb 2025 20:09:35 +0900 Byungchul Park wrote:
> > > On Thu, Feb 20, 2025 at 06:32:22PM +0800, Hillf Danton wrote:
> > > > On Thu, 20 Feb 2025 14:20:01 +0900 Byungchul Park <byungchul@sk.com>
> [..]
>
> Hillf,
>
> The Code of Conduct Committee received reports about your conduct in
> this email discussion.
>
> Link to email where the violation took place:
>
> https://lore.kernel.org/lkml/20250220114920.2383-1-hdanton@sina.com/
>
> Our community works on trust and respect and has agreed to abide by the
> Code of Conduct:
>
> Reference: https://docs.kernel.org/process/code-of-conduct.html
>
> The Code of Conduct Committee has determined that your written abuse
> of another community member required action on your part to repair the
> damage to the individual and the community. You took insufficient action
> to restore the community's faith in having otherwise productive technical
> discussions without the fear of personal attacks.
>
> Following the Code of Conduct Interpretation process the TAB has
> approved the following recommendation:
>
> -- Restrict Hillf Danton's participation in the kernel development
>    process for 3 months.
>
>        - Scope: Ban Hillf Danton from Linux kernel mailing lists for a
>          period of 3 months.

Please ban this guy for another 3 months, as I have another case here[1].
This kind of random and insane personal attack is unacceptable.

[1] https://lore.kernel.org/all/20250309010541.3152-1-hdanton@sina.com/#t

>

Thanks
Barry


^ permalink raw reply	[flat|nested] 102+ messages in thread

[parent not found: <20250619134922.1219-1-hdanton@sina.com>]

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
       [not found]       ` <20250619134922.1219-1-hdanton@sina.com>
@ 2025-06-20 17:00         ` Dan Williams
  0 siblings, 0 replies; 102+ messages in thread
From: Dan Williams @ 2025-06-20 17:00 UTC (permalink / raw)
  To: Hillf Danton, Dan Williams
  Cc: Byungchul Park, linux-kernel, linux-mm, conduct

Hillf Danton wrote:
> > Date: Mon, 10 Mar 2025 16:24:09 -0700 Dan Williams wrote:
> > 
> > Following the Code of Conduct Interpretation process the TAB has
> > approved the following recommendation:
> > 
> > -- Restrict Hillf Danton's participation in the kernel development
> >    process for 3 months.
> > 
> >        - Scope: Ban Hillf Danton from Linux kernel mailing lists for a
> >          period of 3 months.
> > 
> Dan, the ban expires.

Acknowledged, lifted.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
                   ` (26 preceding siblings ...)
  2025-02-20 10:32 ` [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Hillf Danton
@ 2025-02-20 15:15 ` Dave Hansen
  2025-02-20 15:29   ` Vlastimil Babka
  2025-02-20 23:23   ` Byungchul Park
  27 siblings, 2 replies; 102+ messages in thread
From: Dave Hansen @ 2025-02-20 15:15 UTC (permalink / raw)
  To: Byungchul Park, linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

On 2/19/25 21:20, Byungchul Park wrote:
> I'm posting the latest version so that anyone can try luf mechanism if
> wanted by any chance.  However, I tagged RFC again because there are
> still issues that should be resolved to merge to mainline:

I don't see anything fundamentally different here from the last 11
versions. I think the entire approach is dangerous and basically makes
things impossible to debug. It's not clear that some of the failure
scenarios that I've brought up in the past have actually been fixed.

What I've said here still stands:

> https://lore.kernel.org/all/fab1dd64-c652-4160-93b4-7b483a8874da@intel.com/

> I think tglx would call all of this "tinkering".  The approach to this
> series is to "fix" narrow, specific cases that reviewers point out, make
> it compile, then send it out again, hoping someone will apply it.
> 
> So, for me, until the approach to this series changes: NAK, for x86.
> Andrew, please don't take this series.  Or, if you do, please drop the
> patch enabling it on x86.

I think I'd also like to stop being cc'd on this. If LUF is merged into
mainline and proven to work on arm64 or riscv for a year, I'd be happy
to take another look at enabling it on x86. I think that's just about
the only thing that would make me reconsider.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 15:15 ` Dave Hansen
@ 2025-02-20 15:29   ` Vlastimil Babka
  2025-02-20 23:37     ` Byungchul Park
  2025-02-22  1:14     ` [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Shakeel Butt
  2025-02-20 23:23   ` Byungchul Park
  1 sibling, 2 replies; 102+ messages in thread
From: Vlastimil Babka @ 2025-02-20 15:29 UTC (permalink / raw)
  To: Dave Hansen, Byungchul Park, linux-kernel, linux-mm
  Cc: kernel_team, akpm, ying.huang, vernhao, mgorman, hughd, willy,
	david, peterz, luto, tglx, mingo, bp, dave.hansen, rjgolo

On 2/20/25 16:15, Dave Hansen wrote:
> On 2/19/25 21:20, Byungchul Park wrote:
>> I'm posting the latest version so that anyone can try luf mechanism if
>> wanted by any chance.  However, I tagged RFC again because there are
>> still issues that should be resolved to merge to mainline:
> 
> I don't see anything fundamentally different here from the last 11
> versions. I think the entire approach is dangerous and basically makes
> things impossible to debug. It's not clear that some of the failure
> scenarios that I've brought up in the past have actually been fixed.

Yes, and it's still an invasive change to the buddy allocator.
IIRC at Plumbers the opinion in the audience was that there might be ways to
improve the batching on unmap to reduce the flushes without such an invasive
and potentially dangerous change? Has that been investigated?

Also "Rebase on akpm/mm.git mm-unstable(5a7056135b) as of Nov 22, 2024." is
very outdated at this point?

Thanks,
Vlastimil

> What I've said here still stands:
> 
>> https://lore.kernel.org/all/fab1dd64-c652-4160-93b4-7b483a8874da@intel.com/
> 
>> I think tglx would call all of this "tinkering".  The approach to this
>> series is to "fix" narrow, specific cases that reviewers point out, make
>> it compile, then send it out again, hoping someone will apply it.
>> 
>> So, for me, until the approach to this series changes: NAK, for x86.
>> Andrew, please don't take this series.  Or, if you do, please drop the
>> patch enabling it on x86.
> 
> I think I'd also like to stop being cc'd on this. If LUF is merged into
> mainline and proven to work on arm64 or riscv for a year, I'd be happy
> to take another look at enabling it on x86. I think that's just about
> the only thing that would make me reconsider.
> 



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 15:29   ` Vlastimil Babka
@ 2025-02-20 23:37     ` Byungchul Park
  2025-02-26 11:30       ` RFC v12 rebased on v6.14-rc4 Byungchul Park
  2025-02-26 11:33       ` RFC v12 rebased on mm-unstable as of Feb 21, 2025 Byungchul Park
  2025-02-22  1:14     ` [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Shakeel Butt
  1 sibling, 2 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20 23:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Dave Hansen, linux-kernel, linux-mm, kernel_team, akpm,
	ying.huang, vernhao, mgorman, hughd, willy, david, peterz, luto,
	tglx, mingo, bp, dave.hansen, rjgolo

On Thu, Feb 20, 2025 at 04:29:51PM +0100, Vlastimil Babka wrote:
> On 2/20/25 16:15, Dave Hansen wrote:
> > On 2/19/25 21:20, Byungchul Park wrote:
> >> I'm posting the latest version so that anyone can try luf mechanism if
> >> wanted by any chance.  However, I tagged RFC again because there are
> >> still issues that should be resolved to merge to mainline:
> > 
> > I don't see anything fundamentally different here from the last 11
> > versions. I think the entire approach is dangerous and basically makes
> > things impossible to debug. It's not clear that some of the failure
> > scenarios that I've brought up in the past have actually been fixed.
> 
> Yes, and it's still an invasive change to the buddy allocator.

Didn't want.. but admit.

> IIRC at Plumbers the opinion in the audience was that there might be ways to
> improve the batching on unmap to reduce the flushes without such an invasive
> and potentially dangerous change? Has that been investigated?

Sure.  I tried like, by holding those pages not freed until either no
one accesses the interesting pages or memory pressure is high.  However,
unfortunately it was super hard to fix performance degradation by the
number of page reclaim increased due to the unfreed pages.

> Also "Rebase on akpm/mm.git mm-unstable(5a7056135b) as of Nov 22, 2024." is
> very outdated at this point?

Sorry for that.  I will rebase and share.

	Byungchul
> 
> Thanks,
> Vlastimil
> 
> > What I've said here still stands:
> > 
> >> https://lore.kernel.org/all/fab1dd64-c652-4160-93b4-7b483a8874da@intel.com/
> > 
> >> I think tglx would call all of this "tinkering".  The approach to this
> >> series is to "fix" narrow, specific cases that reviewers point out, make
> >> it compile, then send it out again, hoping someone will apply it.
> >> 
> >> So, for me, until the approach to this series changes: NAK, for x86.
> >> Andrew, please don't take this series.  Or, if you do, please drop the
> >> patch enabling it on x86.
> > 
> > I think I'd also like to stop being cc'd on this. If LUF is merged into
> > mainline and proven to work on arm64 or riscv for a year, I'd be happy
> > to take another look at enabling it on x86. I think that's just about
> > the only thing that would make me reconsider.
> > 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* RFC v12 rebased on v6.14-rc4
  2025-02-20 23:37     ` Byungchul Park
@ 2025-02-26 11:30       ` Byungchul Park
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
  2025-02-26 11:33       ` RFC v12 rebased on mm-unstable as of Feb 21, 2025 Byungchul Park
  1 sibling, 1 reply; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 11:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Dave Hansen, linux-kernel, linux-mm, kernel_team, akpm,
	ying.huang, vernhao, mgorman, hughd, willy, david, peterz, luto,
	tglx, mingo, bp, rjgolo

On Fri, Feb 21, 2025 at 08:37:10AM +0900, Byungchul Park wrote:
> On Thu, Feb 20, 2025 at 04:29:51PM +0100, Vlastimil Babka wrote:
> > On 2/20/25 16:15, Dave Hansen wrote:
> > > On 2/19/25 21:20, Byungchul Park wrote:
> > >> I'm posting the latest version so that anyone can try luf mechanism if
> > >> wanted by any chance.  However, I tagged RFC again because there are
> > >> still issues that should be resolved to merge to mainline:
> > > 
> > > I don't see anything fundamentally different here from the last 11
> > > versions. I think the entire approach is dangerous and basically makes
> > > things impossible to debug. It's not clear that some of the failure
> > > scenarios that I've brought up in the past have actually been fixed.
> > 
> > Yes, and it's still an invasive change to the buddy allocator.
> 
> Didn't want.. but admit.
> 
> > IIRC at Plumbers the opinion in the audience was that there might be ways to
> > improve the batching on unmap to reduce the flushes without such an invasive
> > and potentially dangerous change? Has that been investigated?
> 
> Sure.  I tried like, by holding those pages not freed until either no
> one accesses the interesting pages or memory pressure is high.  However,
> unfortunately it was super hard to fix performance degradation by the
> number of page reclaim increased due to the unfreed pages.
> 
> > Also "Rebase on akpm/mm.git mm-unstable(5a7056135b) as of Nov 22, 2024." is
> > very outdated at this point?
> 
> Sorry for that.  I will rebase and share.

This is the same patch set but rebased on v6.14-rc4.

	Byungchul
> 
> 	Byungchul
> > 
> > Thanks,
> > Vlastimil
> > 
> > > What I've said here still stands:
> > > 
> > >> https://lore.kernel.org/all/fab1dd64-c652-4160-93b4-7b483a8874da@intel.com/
> > > 
> > >> I think tglx would call all of this "tinkering".  The approach to this
> > >> series is to "fix" narrow, specific cases that reviewers point out, make
> > >> it compile, then send it out again, hoping someone will apply it.
> > >> 
> > >> So, for me, until the approach to this series changes: NAK, for x86.
> > >> Andrew, please don't take this series.  Or, if you do, please drop the
> > >> patch enabling it on x86.
> > > 
> > > I think I'd also like to stop being cc'd on this. If LUF is merged into
> > > mainline and proven to work on arm64 or riscv for a year, I'd be happy
> > > to take another look at enabling it on x86. I think that's just about
> > > the only thing that would make me reconsider.
> > > 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data
  2025-02-26 11:30       ` RFC v12 rebased on v6.14-rc4 Byungchul Park
@ 2025-02-26 12:03         ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 02/25] arm64/tlbflush: " Byungchul Park
                             ` (23 more replies)
  0 siblings, 24 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read-only and were unmapped, since
the contents of the folios wouldn't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

This is a preparation for the mechanism that needs to recognize
read-only tlb entries by separating tlb batch arch data into two, one is
for read-only entries and the other is for writable ones, and merging
those two when needed.

It also optimizes tlb shootdown by skipping CPUs that have already
performed tlb flush needed since.  To support it, added APIs
manipulating arch data for x86.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/x86/include/asm/tlbflush.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 02fc2aa06e9e0..c27e61bd274a5 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -5,6 +5,7 @@
 #include <linux/mm_types.h>
 #include <linux/mmu_notifier.h>
 #include <linux/sched.h>
+#include <linux/cpumask.h>
 
 #include <asm/processor.h>
 #include <asm/cpufeature.h>
@@ -294,6 +295,29 @@ static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
+static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
+{
+	cpumask_clear(&batch->cpumask);
+}
+
+static inline void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
+		struct arch_tlbflush_unmap_batch *bsrc)
+{
+	cpumask_or(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+}
+
+static inline bool arch_tlbbatch_need_fold(struct arch_tlbflush_unmap_batch *batch,
+		struct mm_struct *mm)
+{
+	return !cpumask_subset(mm_cpumask(mm), &batch->cpumask);
+}
+
+static inline bool arch_tlbbatch_done(struct arch_tlbflush_unmap_batch *bdst,
+		struct arch_tlbflush_unmap_batch *bsrc)
+{
+	return !cpumask_andnot(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+}
+
 static inline bool pte_flags_need_flush(unsigned long oldflags,
 					unsigned long newflags,
 					bool ignore_access)

base-commit: d082ecbc71e9e0bf49883ee4afd435a77a5101b6
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 02/25] arm64/tlbflush: add APIs manipulating tlb batch's arch data
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 03/25] riscv/tlb: " Byungchul Park
                             ` (22 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

This is a preparation for the mechanism that requires to manipulate tlb
batch's arch data.  Even though arm64 does nothing for tlb things, arch
with CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH should provide the APIs.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/arm64/include/asm/tlbflush.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index bc94e036a26b9..acac53a21e5d1 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -354,6 +354,33 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	dsb(ish);
 }
 
+static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
+{
+	/* nothing to do */
+}
+
+static inline void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
+			       struct arch_tlbflush_unmap_batch *bsrc)
+{
+	/* nothing to do */
+}
+
+static inline bool arch_tlbbatch_need_fold(struct arch_tlbflush_unmap_batch *batch,
+			       struct mm_struct *mm)
+{
+	/*
+	 * Nothing is needed in this architecture.
+	 */
+	return false;
+}
+
+static inline bool arch_tlbbatch_done(struct arch_tlbflush_unmap_batch *bdst,
+			       struct arch_tlbflush_unmap_batch *bsrc)
+{
+	/* Kernel can consider tlb batch always has been done. */
+	return true;
+}
+
 /*
  * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
  * necessarily a performance improvement.
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 03/25] riscv/tlb: add APIs manipulating tlb batch's arch data
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 02/25] arm64/tlbflush: " Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 04/25] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_flush() Byungchul Park
                             ` (21 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

This is a preparation for the mechanism that needs to recognize
read-only tlb entries by separating tlb batch arch data into two, one is
for read-only entries and the other is for writable ones, and merging
those two when needed.

It also optimizes tlb shootdown by skipping CPUs that have already
performed tlb flush needed since.  To support it, added APIs
manipulating arch data for riscv.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/riscv/include/asm/tlbflush.h | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index 72e5599349529..1dc7d30273d59 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -8,6 +8,7 @@
 #define _ASM_RISCV_TLBFLUSH_H
 
 #include <linux/mm_types.h>
+#include <linux/cpumask.h>
 #include <asm/smp.h>
 #include <asm/errata_list.h>
 
@@ -65,6 +66,33 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 void arch_flush_tlb_batched_pending(struct mm_struct *mm);
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
+static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
+{
+	cpumask_clear(&batch->cpumask);
+
+}
+
+static inline void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
+		struct arch_tlbflush_unmap_batch *bsrc)
+{
+	cpumask_or(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+
+}
+
+static inline bool arch_tlbbatch_need_fold(struct arch_tlbflush_unmap_batch *batch,
+		struct mm_struct *mm)
+{
+	return !cpumask_subset(mm_cpumask(mm), &batch->cpumask);
+
+}
+
+static inline bool arch_tlbbatch_done(struct arch_tlbflush_unmap_batch *bdst,
+		struct arch_tlbflush_unmap_batch *bsrc)
+{
+	return !cpumask_andnot(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+
+}
+
 extern unsigned long tlb_flush_all_threshold;
 #else /* CONFIG_MMU */
 #define local_flush_tlb_all()			do { } while (0)
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 04/25] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_flush()
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 02/25] arm64/tlbflush: " Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 03/25] riscv/tlb: " Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 05/25] mm/buddy: make room for a new variable, luf_key, in struct page Byungchul Park
                             ` (20 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

This is a preparation for the mechanism that requires to avoid redundant
tlb flush by manipulating tlb batch's arch data.  To achieve that, we
need to separate the part clearing the tlb batch's arch data out of
arch_tlbbatch_flush().

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/riscv/mm/tlbflush.c | 1 -
 arch/x86/mm/tlb.c        | 2 --
 mm/rmap.c                | 1 +
 3 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 9b6e86ce38674..36f996af6256c 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -201,5 +201,4 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	__flush_tlb_range(&batch->cpumask, FLUSH_TLB_NO_ASID, 0,
 			  FLUSH_TLB_MAX_SIZE, PAGE_SIZE);
-	cpumask_clear(&batch->cpumask);
 }
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6cf881a942bbe..523e8bb6fba1f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1292,8 +1292,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 		local_irq_enable();
 	}
 
-	cpumask_clear(&batch->cpumask);
-
 	put_flush_tlb_info();
 	put_cpu();
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index c6c4d4ea29a7e..2de01de164ef0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -648,6 +648,7 @@ void try_to_unmap_flush(void)
 		return;
 
 	arch_tlbbatch_flush(&tlb_ubc->arch);
+	arch_tlbbatch_clear(&tlb_ubc->arch);
 	tlb_ubc->flush_required = false;
 	tlb_ubc->writable = false;
 }
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 05/25] mm/buddy: make room for a new variable, luf_key, in struct page
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (2 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 04/25] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_flush() Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 06/25] mm: move should_skip_kasan_poison() to mm/internal.h Byungchul Park
                             ` (19 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
tracks need of tlb flush for each page residing in buddy.

Since the private field in struct page is used only to store page order
in buddy, ranging from 0 to MAX_PAGE_ORDER, that can be covered with
unsigned short.  So splitted it into two smaller ones, order and luf_key,
so that the both can be used in buddy at the same time.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm_types.h | 42 +++++++++++++++++++++++++++++++++-------
 mm/internal.h            |  4 ++--
 mm/page_alloc.c          |  2 +-
 3 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0234f14f2aa6b..7d78a285e52ca 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -106,13 +106,27 @@ struct page {
 				pgoff_t index;		/* Our offset within mapping. */
 				unsigned long share;	/* share count for fsdax */
 			};
-			/**
-			 * @private: Mapping-private opaque data.
-			 * Usually used for buffer_heads if PagePrivate.
-			 * Used for swp_entry_t if swapcache flag set.
-			 * Indicates order in the buddy system if PageBuddy.
-			 */
-			unsigned long private;
+			union {
+				/**
+				 * @private: Mapping-private opaque data.
+				 * Usually used for buffer_heads if PagePrivate.
+				 * Used for swp_entry_t if swapcache flag set.
+				 * Indicates order in the buddy system if PageBuddy.
+				 */
+				unsigned long private;
+				struct {
+					/*
+					 * Indicates order in the buddy system if PageBuddy.
+					 */
+					unsigned short order;
+
+					/*
+					 * For tracking need of tlb flush,
+					 * by luf(lazy unmap flush).
+					 */
+					unsigned short luf_key;
+				};
+			};
 		};
 		struct {	/* page_pool used by netstack */
 			/**
@@ -566,6 +580,20 @@ static inline void set_page_private(struct page *page, unsigned long private)
 	page->private = private;
 }
 
+#define page_buddy_order(page)		((page)->order)
+
+static inline void set_page_buddy_order(struct page *page, unsigned int order)
+{
+	page->order = (unsigned short)order;
+}
+
+#define page_luf_key(page)		((page)->luf_key)
+
+static inline void set_page_luf_key(struct page *page, unsigned short luf_key)
+{
+	page->luf_key = luf_key;
+}
+
 static inline void *folio_get_private(struct folio *folio)
 {
 	return folio->private;
diff --git a/mm/internal.h b/mm/internal.h
index 109ef30fee11f..d7161a6e0b352 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -543,7 +543,7 @@ struct alloc_context {
 static inline unsigned int buddy_order(struct page *page)
 {
 	/* PageBuddy() must be checked by the caller */
-	return page_private(page);
+	return page_buddy_order(page);
 }
 
 /*
@@ -557,7 +557,7 @@ static inline unsigned int buddy_order(struct page *page)
  * times, potentially observing different values in the tests and the actual
  * use of the result.
  */
-#define buddy_order_unsafe(page)	READ_ONCE(page_private(page))
+#define buddy_order_unsafe(page)	READ_ONCE(page_buddy_order(page))
 
 /*
  * This function checks whether a page is free && is the buddy
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 579789600a3c7..c08b1389d5671 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -576,7 +576,7 @@ void prep_compound_page(struct page *page, unsigned int order)
 
 static inline void set_buddy_order(struct page *page, unsigned int order)
 {
-	set_page_private(page, order);
+	set_page_buddy_order(page, order);
 	__SetPageBuddy(page);
 }
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 06/25] mm: move should_skip_kasan_poison() to mm/internal.h
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (3 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 05/25] mm/buddy: make room for a new variable, luf_key, in struct page Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 07/25] mm: introduce luf_ugen to be used as a global timestamp Byungchul Park
                             ` (18 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
needs to use should_skip_kasan_poison() function in mm/internal.h.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/internal.h   | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c | 47 -----------------------------------------------
 2 files changed, 47 insertions(+), 47 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index d7161a6e0b352..4c8ed93a792ec 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1051,8 +1051,55 @@ static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
 DECLARE_STATIC_KEY_TRUE(deferred_pages);
 
 bool __init deferred_grow_zone(struct zone *zone, unsigned int order);
+
+static inline bool deferred_pages_enabled(void)
+{
+	return static_branch_unlikely(&deferred_pages);
+}
+#else
+static inline bool deferred_pages_enabled(void)
+{
+	return false;
+}
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
+/*
+ * Skip KASAN memory poisoning when either:
+ *
+ * 1. For generic KASAN: deferred memory initialization has not yet completed.
+ *    Tag-based KASAN modes skip pages freed via deferred memory initialization
+ *    using page tags instead (see below).
+ * 2. For tag-based KASAN modes: the page has a match-all KASAN tag, indicating
+ *    that error detection is disabled for accesses via the page address.
+ *
+ * Pages will have match-all tags in the following circumstances:
+ *
+ * 1. Pages are being initialized for the first time, including during deferred
+ *    memory init; see the call to page_kasan_tag_reset in __init_single_page.
+ * 2. The allocation was not unpoisoned due to __GFP_SKIP_KASAN, with the
+ *    exception of pages unpoisoned by kasan_unpoison_vmalloc.
+ * 3. The allocation was excluded from being checked due to sampling,
+ *    see the call to kasan_unpoison_pages.
+ *
+ * Poisoning pages during deferred memory init will greatly lengthen the
+ * process and cause problem in large memory systems as the deferred pages
+ * initialization is done with interrupt disabled.
+ *
+ * Assuming that there will be no reference to those newly initialized
+ * pages before they are ever allocated, this should have no effect on
+ * KASAN memory tracking as the poison will be properly inserted at page
+ * allocation time. The only corner case is when pages are allocated by
+ * on-demand allocation and then freed again before the deferred pages
+ * initialization is done, but this is not likely to happen.
+ */
+static inline bool should_skip_kasan_poison(struct page *page)
+{
+	if (IS_ENABLED(CONFIG_KASAN_GENERIC))
+		return deferred_pages_enabled();
+
+	return page_kasan_tag(page) == KASAN_TAG_KERNEL;
+}
+
 enum mminit_level {
 	MMINIT_WARNING,
 	MMINIT_VERIFY,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c08b1389d5671..27aeee0cfcf8f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -299,11 +299,6 @@ int page_group_by_mobility_disabled __read_mostly;
  */
 DEFINE_STATIC_KEY_TRUE(deferred_pages);
 
-static inline bool deferred_pages_enabled(void)
-{
-	return static_branch_unlikely(&deferred_pages);
-}
-
 /*
  * deferred_grow_zone() is __init, but it is called from
  * get_page_from_freelist() during early boot until deferred_pages permanently
@@ -316,11 +311,6 @@ _deferred_grow_zone(struct zone *zone, unsigned int order)
 	return deferred_grow_zone(zone, order);
 }
 #else
-static inline bool deferred_pages_enabled(void)
-{
-	return false;
-}
-
 static inline bool _deferred_grow_zone(struct zone *zone, unsigned int order)
 {
 	return false;
@@ -993,43 +983,6 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 	return ret;
 }
 
-/*
- * Skip KASAN memory poisoning when either:
- *
- * 1. For generic KASAN: deferred memory initialization has not yet completed.
- *    Tag-based KASAN modes skip pages freed via deferred memory initialization
- *    using page tags instead (see below).
- * 2. For tag-based KASAN modes: the page has a match-all KASAN tag, indicating
- *    that error detection is disabled for accesses via the page address.
- *
- * Pages will have match-all tags in the following circumstances:
- *
- * 1. Pages are being initialized for the first time, including during deferred
- *    memory init; see the call to page_kasan_tag_reset in __init_single_page.
- * 2. The allocation was not unpoisoned due to __GFP_SKIP_KASAN, with the
- *    exception of pages unpoisoned by kasan_unpoison_vmalloc.
- * 3. The allocation was excluded from being checked due to sampling,
- *    see the call to kasan_unpoison_pages.
- *
- * Poisoning pages during deferred memory init will greatly lengthen the
- * process and cause problem in large memory systems as the deferred pages
- * initialization is done with interrupt disabled.
- *
- * Assuming that there will be no reference to those newly initialized
- * pages before they are ever allocated, this should have no effect on
- * KASAN memory tracking as the poison will be properly inserted at page
- * allocation time. The only corner case is when pages are allocated by
- * on-demand allocation and then freed again before the deferred pages
- * initialization is done, but this is not likely to happen.
- */
-static inline bool should_skip_kasan_poison(struct page *page)
-{
-	if (IS_ENABLED(CONFIG_KASAN_GENERIC))
-		return deferred_pages_enabled();
-
-	return page_kasan_tag(page) == KASAN_TAG_KERNEL;
-}
-
 static void kernel_init_pages(struct page *page, int numpages)
 {
 	int i;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 07/25] mm: introduce luf_ugen to be used as a global timestamp
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (4 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 06/25] mm: move should_skip_kasan_poison() to mm/internal.h Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 08/25] mm: introduce luf_batch to be used as hash table to store luf meta data Byungchul Park
                             ` (17 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
needs to evaluate the temporal sequence of events to determine whether
tlb flush required has been done on each CPU.

To achieve that, this patch introduced a generation number, luf_ugen,
and a few APIs manipulating the number.  It's worth noting the number is
designed to wraparound so care must be taken when using it.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm.h | 34 ++++++++++++++++++++++++++++++++++
 mm/rmap.c          | 22 ++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7b1068ddcbb70..8c3481402d8cb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4155,4 +4155,38 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
 int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
 int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+/*
+ * luf_ugen will start with 2 so that 1 can be regarded as a passed one.
+ */
+#define LUF_UGEN_INIT 2
+
+static inline bool ugen_before(unsigned long a, unsigned long b)
+{
+	/*
+	 * Consider wraparound.
+	 */
+	return (long)(a - b) < 0;
+}
+
+static inline unsigned long next_ugen(unsigned long ugen)
+{
+	if (ugen + 1)
+		return ugen + 1;
+	/*
+	 * Avoid invalid ugen, zero.
+	 */
+	return ugen + 2;
+}
+
+static inline unsigned long prev_ugen(unsigned long ugen)
+{
+	if (ugen - 1)
+		return ugen - 1;
+	/*
+	 * Avoid invalid ugen, zero.
+	 */
+	return ugen - 2;
+}
+#endif
 #endif /* _LINUX_MM_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index 2de01de164ef0..ed345503e4f88 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -634,6 +634,28 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
 }
 
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
+/*
+ * This generation number is primarily used as a global timestamp to
+ * determine whether tlb flush required has been done on each CPU.  The
+ * function, ugen_before(), should be used to evaluate the temporal
+ * sequence of events because the number is designed to wraparound.
+ */
+static atomic_long_t __maybe_unused luf_ugen = ATOMIC_LONG_INIT(LUF_UGEN_INIT);
+
+/*
+ * Don't return invalid luf_ugen, zero.
+ */
+static unsigned long __maybe_unused new_luf_ugen(void)
+{
+	unsigned long ugen = atomic_long_inc_return(&luf_ugen);
+
+	if (!ugen)
+		ugen = atomic_long_inc_return(&luf_ugen);
+
+	return ugen;
+}
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 08/25] mm: introduce luf_batch to be used as hash table to store luf meta data
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (5 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 07/25] mm: introduce luf_ugen to be used as a global timestamp Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 09/25] mm: introduce API to perform tlb shootdown on exit from page allocator Byungchul Park
                             ` (16 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
needs to keep luf meta data per page while staying in pcp or buddy
allocator.  The meta data includes cpumask for tlb shootdown and luf's
request generation number.

Since struct page doesn't have enough room to store luf meta data, this
patch introduces a hash table to store them and makes each page keep its
hash key instead.

Since all the pages in pcp or buddy share the hash table, confliction is
inevitable so care must be taken when reading or updating its entry.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm_types.h |  10 ++++
 mm/internal.h            |   8 +++
 mm/rmap.c                | 122 +++++++++++++++++++++++++++++++++++++--
 3 files changed, 136 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7d78a285e52ca..4bfe8d072b0ea 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -32,6 +32,16 @@
 struct address_space;
 struct mem_cgroup;
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+struct luf_batch {
+	struct tlbflush_unmap_batch batch;
+	unsigned long ugen;
+	rwlock_t lock;
+};
+#else
+struct luf_batch {};
+#endif
+
 /*
  * Each physical page in the system has a struct page associated with
  * it to keep track of whatever it is we are using the page for at the
diff --git a/mm/internal.h b/mm/internal.h
index 4c8ed93a792ec..3333d8d461c2c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1253,6 +1253,8 @@ extern struct workqueue_struct *mm_percpu_wq;
 void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
 void flush_tlb_batched_pending(struct mm_struct *mm);
+void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset);
+void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src);
 #else
 static inline void try_to_unmap_flush(void)
 {
@@ -1263,6 +1265,12 @@ static inline void try_to_unmap_flush_dirty(void)
 static inline void flush_tlb_batched_pending(struct mm_struct *mm)
 {
 }
+static inline void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset)
+{
+}
+static inline void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
+{
+}
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
 extern const struct trace_print_flags pageflag_names[];
diff --git a/mm/rmap.c b/mm/rmap.c
index ed345503e4f88..74fbf6c2fb3a7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -641,7 +641,7 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
  * function, ugen_before(), should be used to evaluate the temporal
  * sequence of events because the number is designed to wraparound.
  */
-static atomic_long_t __maybe_unused luf_ugen = ATOMIC_LONG_INIT(LUF_UGEN_INIT);
+static atomic_long_t luf_ugen = ATOMIC_LONG_INIT(LUF_UGEN_INIT);
 
 /*
  * Don't return invalid luf_ugen, zero.
@@ -656,6 +656,122 @@ static unsigned long __maybe_unused new_luf_ugen(void)
 	return ugen;
 }
 
+static void reset_batch(struct tlbflush_unmap_batch *batch)
+{
+	arch_tlbbatch_clear(&batch->arch);
+	batch->flush_required = false;
+	batch->writable = false;
+}
+
+void fold_batch(struct tlbflush_unmap_batch *dst,
+		struct tlbflush_unmap_batch *src, bool reset)
+{
+	if (!src->flush_required)
+		return;
+
+	/*
+	 * Fold src to dst.
+	 */
+	arch_tlbbatch_fold(&dst->arch, &src->arch);
+	dst->writable = dst->writable || src->writable;
+	dst->flush_required = true;
+
+	if (!reset)
+		return;
+
+	/*
+	 * Reset src.
+	 */
+	reset_batch(src);
+}
+
+/*
+ * The range that luf_key covers, which is 'unsigned short' type.
+ */
+#define NR_LUF_BATCH (1 << (sizeof(short) * 8))
+
+/*
+ * Use 0th entry as accumulated batch.
+ */
+static struct luf_batch luf_batch[NR_LUF_BATCH];
+
+static void luf_batch_init(struct luf_batch *lb)
+{
+	rwlock_init(&lb->lock);
+	reset_batch(&lb->batch);
+	lb->ugen = atomic_long_read(&luf_ugen) - 1;
+}
+
+static int __init luf_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_LUF_BATCH; i++)
+		luf_batch_init(&luf_batch[i]);
+
+	return 0;
+}
+early_initcall(luf_init);
+
+/*
+ * key to point an entry of the luf_batch array
+ *
+ * note: zero means invalid key
+ */
+static atomic_t luf_kgen = ATOMIC_INIT(1);
+
+/*
+ * Don't return invalid luf_key, zero.
+ */
+static unsigned short __maybe_unused new_luf_key(void)
+{
+	unsigned short luf_key = atomic_inc_return(&luf_kgen);
+
+	if (!luf_key)
+		luf_key = atomic_inc_return(&luf_kgen);
+
+	return luf_key;
+}
+
+static void __fold_luf_batch(struct luf_batch *dst_lb,
+		struct tlbflush_unmap_batch *src_batch,
+		unsigned long src_ugen)
+{
+	/*
+	 * dst_lb->ugen represents one that requires tlb shootdown for
+	 * it, that is, sort of request number.  The newer it is, the
+	 * more tlb shootdown might be needed to fulfill the newer
+	 * request.  Conservertively keep the newer one.
+	 */
+	if (!dst_lb->ugen || ugen_before(dst_lb->ugen, src_ugen))
+		dst_lb->ugen = src_ugen;
+	fold_batch(&dst_lb->batch, src_batch, false);
+}
+
+void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
+{
+	unsigned long flags;
+
+	/*
+	 * Exactly same.  Nothing to fold.
+	 */
+	if (dst == src)
+		return;
+
+	if (&src->lock < &dst->lock) {
+		read_lock_irqsave(&src->lock, flags);
+		write_lock(&dst->lock);
+	} else {
+		write_lock_irqsave(&dst->lock, flags);
+		read_lock(&src->lock);
+	}
+
+	__fold_luf_batch(dst, &src->batch, src->ugen);
+
+	write_unlock(&dst->lock);
+	read_unlock_irqrestore(&src->lock, flags);
+}
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
@@ -670,9 +786,7 @@ void try_to_unmap_flush(void)
 		return;
 
 	arch_tlbbatch_flush(&tlb_ubc->arch);
-	arch_tlbbatch_clear(&tlb_ubc->arch);
-	tlb_ubc->flush_required = false;
-	tlb_ubc->writable = false;
+	reset_batch(tlb_ubc);
 }
 
 /* Flush iff there are potentially writable TLB entries that can race with IO */
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 09/25] mm: introduce API to perform tlb shootdown on exit from page allocator
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (6 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 08/25] mm: introduce luf_batch to be used as hash table to store luf meta data Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 10/25] mm: introduce APIs to check if the page allocation is tlb shootdownable Byungchul Park
                             ` (15 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
performs tlb shootdown required on exit from page allocator.

This patch introduced a new API rather than making use of existing
try_to_unmap_flush() to avoid repeated and redundant tlb shootdown due
to frequent page allocations during a session of batched unmap flush.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/sched.h |  1 +
 mm/internal.h         |  4 ++++
 mm/rmap.c             | 20 ++++++++++++++++++++
 3 files changed, 25 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9632e3318e0d6..86ef426644639 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1401,6 +1401,7 @@ struct task_struct {
 #endif
 
 	struct tlbflush_unmap_batch	tlb_ubc;
+	struct tlbflush_unmap_batch	tlb_ubc_takeoff;
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/mm/internal.h b/mm/internal.h
index 3333d8d461c2c..b52e14f86c436 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1252,6 +1252,7 @@ extern struct workqueue_struct *mm_percpu_wq;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
+void try_to_unmap_flush_takeoff(void);
 void flush_tlb_batched_pending(struct mm_struct *mm);
 void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset);
 void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src);
@@ -1262,6 +1263,9 @@ static inline void try_to_unmap_flush(void)
 static inline void try_to_unmap_flush_dirty(void)
 {
 }
+static inline void try_to_unmap_flush_takeoff(void)
+{
+}
 static inline void flush_tlb_batched_pending(struct mm_struct *mm)
 {
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 74fbf6c2fb3a7..72c5e665e59a4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -772,6 +772,26 @@ void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
 	read_unlock_irqrestore(&src->lock, flags);
 }
 
+void try_to_unmap_flush_takeoff(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
+
+	if (!tlb_ubc_takeoff->flush_required)
+		return;
+
+	arch_tlbbatch_flush(&tlb_ubc_takeoff->arch);
+
+	/*
+	 * Now that tlb shootdown of tlb_ubc_takeoff has been performed,
+	 * it's good chance to shrink tlb_ubc if possible.
+	 */
+	if (arch_tlbbatch_done(&tlb_ubc->arch, &tlb_ubc_takeoff->arch))
+		reset_batch(tlb_ubc);
+
+	reset_batch(tlb_ubc_takeoff);
+}
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 10/25] mm: introduce APIs to check if the page allocation is tlb shootdownable
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (7 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 09/25] mm: introduce API to perform tlb shootdown on exit from page allocator Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 11/25] mm: deliver luf_key to pcp or buddy on free after unmapping Byungchul Park
                             ` (14 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
should indentify if tlb shootdown can be performed on page allocation.

In a context with irq disabled or non-task, tlb shootdown cannot be
performed because of deadlock issue.  Thus, page allocator should work
being aware of whether tlb shootdown can be performed on returning page.

This patch introduced APIs that pcp or buddy page allocator can use to
delimit the critical sections taking off pages and indentify whether
tlb shootdown can be performed.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/sched.h |   5 ++
 mm/internal.h         |  14 ++++
 mm/page_alloc.c       | 159 ++++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c             |   2 +-
 4 files changed, 179 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 86ef426644639..a3049ea5b3ad3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1400,6 +1400,11 @@ struct task_struct {
 	struct callback_head		cid_work;
 #endif
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+	int luf_no_shootdown;
+	int luf_takeoff_started;
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 	struct tlbflush_unmap_batch	tlb_ubc_takeoff;
 
diff --git a/mm/internal.h b/mm/internal.h
index b52e14f86c436..5e67f009d23c6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1580,6 +1580,20 @@ static inline void accept_page(struct page *page)
 {
 }
 #endif /* CONFIG_UNACCEPTED_MEMORY */
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+extern struct luf_batch luf_batch[];
+bool luf_takeoff_start(void);
+void luf_takeoff_end(void);
+bool luf_takeoff_no_shootdown(void);
+bool luf_takeoff_check(struct page *page);
+bool luf_takeoff_check_and_fold(struct page *page);
+#else
+static inline bool luf_takeoff_start(void) { return false; }
+static inline void luf_takeoff_end(void) {}
+static inline bool luf_takeoff_no_shootdown(void) { return true; }
+static inline bool luf_takeoff_check(struct page *page) { return true; }
+static inline bool luf_takeoff_check_and_fold(struct page *page) { return true; }
+#endif
 
 /* pagewalk.c */
 int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 27aeee0cfcf8f..a964a98fbad51 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -622,6 +622,165 @@ compaction_capture(struct capture_control *capc, struct page *page,
 }
 #endif /* CONFIG_COMPACTION */
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+static bool no_shootdown_context(void)
+{
+	/*
+	 * If it performs with irq disabled, that might cause a deadlock.
+	 * Avoid tlb shootdown in this case.
+	 */
+	return !(!irqs_disabled() && in_task());
+}
+
+/*
+ * Can be called with zone lock released and irq enabled.
+ */
+bool luf_takeoff_start(void)
+{
+	unsigned long flags;
+	bool no_shootdown = no_shootdown_context();
+
+	local_irq_save(flags);
+
+	/*
+	 * It's the outmost luf_takeoff_start().
+	 */
+	if (!current->luf_takeoff_started)
+		VM_WARN_ON(current->luf_no_shootdown);
+
+	/*
+	 * current->luf_no_shootdown > 0 doesn't mean tlb shootdown is
+	 * not allowed at all.  However, it guarantees tlb shootdown is
+	 * possible once current->luf_no_shootdown == 0.  It might look
+	 * too conservative but for now do this way for simplity.
+	 */
+	if (no_shootdown || current->luf_no_shootdown)
+		current->luf_no_shootdown++;
+
+	current->luf_takeoff_started++;
+	local_irq_restore(flags);
+
+	return !no_shootdown;
+}
+
+/*
+ * Should be called within the same context of luf_takeoff_start().
+ */
+void luf_takeoff_end(void)
+{
+	unsigned long flags;
+	bool no_shootdown;
+	bool outmost = false;
+
+	local_irq_save(flags);
+	VM_WARN_ON(!current->luf_takeoff_started);
+
+	/*
+	 * Assume the context and irq flags are same as those at
+	 * luf_takeoff_start().
+	 */
+	if (current->luf_no_shootdown)
+		current->luf_no_shootdown--;
+
+	no_shootdown = !!current->luf_no_shootdown;
+
+	current->luf_takeoff_started--;
+
+	/*
+	 * It's the outmost luf_takeoff_end().
+	 */
+	if (!current->luf_takeoff_started)
+		outmost = true;
+
+	local_irq_restore(flags);
+
+	if (no_shootdown)
+		goto out;
+
+	try_to_unmap_flush_takeoff();
+out:
+	if (outmost)
+		VM_WARN_ON(current->luf_no_shootdown);
+}
+
+/*
+ * Can be called with zone lock released and irq enabled.
+ */
+bool luf_takeoff_no_shootdown(void)
+{
+	bool no_shootdown = true;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	/*
+	 * No way.  Delimit using luf_takeoff_{start,end}().
+	 */
+	if (unlikely(!current->luf_takeoff_started)) {
+		VM_WARN_ON(1);
+		goto out;
+	}
+	no_shootdown = current->luf_no_shootdown;
+out:
+	local_irq_restore(flags);
+	return no_shootdown;
+}
+
+/*
+ * Should be called with either zone lock held and irq disabled or pcp
+ * lock held.
+ */
+bool luf_takeoff_check(struct page *page)
+{
+	unsigned short luf_key = page_luf_key(page);
+
+	/*
+	 * No way.  Delimit using luf_takeoff_{start,end}().
+	 */
+	if (unlikely(!current->luf_takeoff_started)) {
+		VM_WARN_ON(1);
+		return false;
+	}
+
+	if (!luf_key)
+		return true;
+
+	return !current->luf_no_shootdown;
+}
+
+/*
+ * Should be called with either zone lock held and irq disabled or pcp
+ * lock held.
+ */
+bool luf_takeoff_check_and_fold(struct page *page)
+{
+	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
+	unsigned short luf_key = page_luf_key(page);
+	struct luf_batch *lb;
+	unsigned long flags;
+
+	/*
+	 * No way.  Delimit using luf_takeoff_{start,end}().
+	 */
+	if (unlikely(!current->luf_takeoff_started)) {
+		VM_WARN_ON(1);
+		return false;
+	}
+
+	if (!luf_key)
+		return true;
+
+	if (current->luf_no_shootdown)
+		return false;
+
+	lb = &luf_batch[luf_key];
+	read_lock_irqsave(&lb->lock, flags);
+	fold_batch(tlb_ubc_takeoff, &lb->batch, false);
+	read_unlock_irqrestore(&lb->lock, flags);
+	return true;
+}
+#endif
+
 static inline void account_freepages(struct zone *zone, int nr_pages,
 				     int migratetype)
 {
diff --git a/mm/rmap.c b/mm/rmap.c
index 72c5e665e59a4..1581b1a00f974 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -693,7 +693,7 @@ void fold_batch(struct tlbflush_unmap_batch *dst,
 /*
  * Use 0th entry as accumulated batch.
  */
-static struct luf_batch luf_batch[NR_LUF_BATCH];
+struct luf_batch luf_batch[NR_LUF_BATCH];
 
 static void luf_batch_init(struct luf_batch *lb)
 {
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 11/25] mm: deliver luf_key to pcp or buddy on free after unmapping
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (8 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 10/25] mm: introduce APIs to check if the page allocation is tlb shootdownable Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 12/25] mm: delimit critical sections to take off pages from pcp or buddy alloctor Byungchul Park
                             ` (13 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
needs to pass luf_key to pcp or buddy allocator on free after unmapping
e.g. during page reclaim or page migration.

The luf_key will be used to track need of tlb shootdown and which cpus
need to perform tlb flush, per page residing in pcp or buddy, and should
be handed over properly when pages travel between pcp and buddy.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/internal.h        |   4 +-
 mm/page_alloc.c      | 116 ++++++++++++++++++++++++++++++++-----------
 mm/page_frag_cache.c |   6 +--
 mm/page_isolation.c  |   6 +++
 mm/page_reporting.c  |   6 +++
 mm/slub.c            |   2 +-
 mm/swap.c            |   4 +-
 mm/vmscan.c          |   8 +--
 8 files changed, 111 insertions(+), 41 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 5e67f009d23c6..47d3291278e81 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -746,8 +746,8 @@ struct page *__alloc_frozen_pages_noprof(gfp_t, unsigned int order, int nid,
 		nodemask_t *);
 #define __alloc_frozen_pages(...) \
 	alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
-void free_frozen_pages(struct page *page, unsigned int order);
-void free_unref_folios(struct folio_batch *fbatch);
+void free_frozen_pages(struct page *page, unsigned int order, unsigned short luf_key);
+void free_unref_folios(struct folio_batch *fbatch, unsigned short luf_key);
 
 #ifdef CONFIG_NUMA
 struct page *alloc_frozen_pages_noprof(gfp_t, unsigned int order);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a964a98fbad51..d2d23bbd60467 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -212,7 +212,7 @@ unsigned int pageblock_order __read_mostly;
 #endif
 
 static void __free_pages_ok(struct page *page, unsigned int order,
-			    fpi_t fpi_flags);
+			    fpi_t fpi_flags, unsigned short luf_key);
 
 /*
  * results with 256, 32 in the lowmem_reserve sysctl:
@@ -850,8 +850,13 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
 
 	list_del(&page->buddy_list);
 	__ClearPageBuddy(page);
-	set_page_private(page, 0);
 	zone->free_area[order].nr_free--;
+
+	/*
+	 * Keep head page's private until post_alloc_hook().
+	 *
+	 * XXX: Tail pages' private doesn't get cleared.
+	 */
 }
 
 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
@@ -920,7 +925,7 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
 static inline void __free_one_page(struct page *page,
 		unsigned long pfn,
 		struct zone *zone, unsigned int order,
-		int migratetype, fpi_t fpi_flags)
+		int migratetype, fpi_t fpi_flags, unsigned short luf_key)
 {
 	struct capture_control *capc = task_capc(zone);
 	unsigned long buddy_pfn = 0;
@@ -937,10 +942,21 @@ static inline void __free_one_page(struct page *page,
 
 	account_freepages(zone, 1 << order, migratetype);
 
+	/*
+	 * Use the page's luf_key unchanged if luf_key == 0.  Worth
+	 * noting that page_luf_key() will be 0 in most cases since it's
+	 * initialized at free_pages_prepare().
+	 */
+	if (luf_key)
+		set_page_luf_key(page, luf_key);
+	else
+		luf_key = page_luf_key(page);
+
 	while (order < MAX_PAGE_ORDER) {
 		int buddy_mt = migratetype;
+		unsigned short buddy_luf_key;
 
-		if (compaction_capture(capc, page, order, migratetype)) {
+		if (!luf_key && compaction_capture(capc, page, order, migratetype)) {
 			account_freepages(zone, -(1 << order), migratetype);
 			return;
 		}
@@ -973,6 +989,18 @@ static inline void __free_one_page(struct page *page,
 		else
 			__del_page_from_free_list(buddy, zone, order, buddy_mt);
 
+		/*
+		 * !buddy_luf_key && !luf_key : do nothing
+		 *  buddy_luf_key && !luf_key : luf_key = buddy_luf_key
+		 * !buddy_luf_key &&  luf_key : do nothing
+		 *  buddy_luf_key &&  luf_key : merge two into luf_key
+		 */
+		buddy_luf_key = page_luf_key(buddy);
+		if (buddy_luf_key && !luf_key)
+			luf_key = buddy_luf_key;
+		else if (buddy_luf_key && luf_key)
+			fold_luf_batch(&luf_batch[luf_key], &luf_batch[buddy_luf_key]);
+
 		if (unlikely(buddy_mt != migratetype)) {
 			/*
 			 * Match buddy type. This ensures that an
@@ -984,6 +1012,7 @@ static inline void __free_one_page(struct page *page,
 
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
+		set_page_luf_key(page, luf_key);
 		pfn = combined_pfn;
 		order++;
 	}
@@ -1164,6 +1193,11 @@ __always_inline bool free_pages_prepare(struct page *page,
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
+	/*
+	 * Ensure private is zero before using it inside allocator.
+	 */
+	set_page_private(page, 0);
+
 	trace_mm_page_free(page, order);
 	kmsan_free_page(page, order);
 
@@ -1329,7 +1363,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			count -= nr_pages;
 			pcp->count -= nr_pages;
 
-			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
+			__free_one_page(page, pfn, zone, order, mt, FPI_NONE, 0);
+
 			trace_mm_page_pcpu_drain(page, order, mt);
 		} while (count > 0 && !list_empty(list));
 	}
@@ -1353,7 +1388,7 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 	do {
 		int mt = get_pfnblock_migratetype(page, pfn);
 
-		__free_one_page(page, pfn, zone, order, mt, fpi);
+		__free_one_page(page, pfn, zone, order, mt, fpi, 0);
 		pfn += 1 << order;
 		if (pfn == end)
 			break;
@@ -1363,11 +1398,18 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 
 static void free_one_page(struct zone *zone, struct page *page,
 			  unsigned long pfn, unsigned int order,
-			  fpi_t fpi_flags)
+			  fpi_t fpi_flags, unsigned short luf_key)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&zone->lock, flags);
+
+	/*
+	 * valid luf_key can be passed only if order == 0.
+	 */
+	VM_WARN_ON(luf_key && order);
+	set_page_luf_key(page, luf_key);
+
 	split_large_buddy(zone, page, pfn, order, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
@@ -1375,13 +1417,13 @@ static void free_one_page(struct zone *zone, struct page *page,
 }
 
 static void __free_pages_ok(struct page *page, unsigned int order,
-			    fpi_t fpi_flags)
+			    fpi_t fpi_flags, unsigned short luf_key)
 {
 	unsigned long pfn = page_to_pfn(page);
 	struct zone *zone = page_zone(page);
 
 	if (free_pages_prepare(page, order))
-		free_one_page(zone, page, pfn, order, fpi_flags);
+		free_one_page(zone, page, pfn, order, fpi_flags, luf_key);
 }
 
 void __meminit __free_pages_core(struct page *page, unsigned int order,
@@ -1429,7 +1471,7 @@ void __meminit __free_pages_core(struct page *page, unsigned int order,
 	 * Bypass PCP and place fresh pages right to the tail, primarily
 	 * relevant for memory onlining.
 	 */
-	__free_pages_ok(page, order, FPI_TO_TAIL);
+	__free_pages_ok(page, order, FPI_TO_TAIL, 0);
 }
 
 /*
@@ -2426,6 +2468,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		if (unlikely(page == NULL))
 			break;
 
+		/*
+		 * Keep the page's luf_key.
+		 */
+
 		/*
 		 * Split buddy pages returned by expand() are received here in
 		 * physical page order. The page is added to the tail of
@@ -2707,12 +2753,14 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 
 static void free_frozen_page_commit(struct zone *zone,
 		struct per_cpu_pages *pcp, struct page *page, int migratetype,
-		unsigned int order)
+		unsigned int order, unsigned short luf_key)
 {
 	int high, batch;
 	int pindex;
 	bool free_high = false;
 
+	set_page_luf_key(page, luf_key);
+
 	/*
 	 * On freeing, reduce the number of pages that are batch allocated.
 	 * See nr_pcp_alloc() where alloc_factor is increased for subsequent
@@ -2721,7 +2769,16 @@ static void free_frozen_page_commit(struct zone *zone,
 	pcp->alloc_factor >>= 1;
 	__count_vm_events(PGFREE, 1 << order);
 	pindex = order_to_pindex(migratetype, order);
-	list_add(&page->pcp_list, &pcp->lists[pindex]);
+
+	/*
+	 * Defer tlb shootdown as much as possible by putting luf'd
+	 * pages to the tail.
+	 */
+	if (luf_key)
+		list_add_tail(&page->pcp_list, &pcp->lists[pindex]);
+	else
+		list_add(&page->pcp_list, &pcp->lists[pindex]);
+
 	pcp->count += 1 << order;
 
 	batch = READ_ONCE(pcp->batch);
@@ -2756,7 +2813,8 @@ static void free_frozen_page_commit(struct zone *zone,
 /*
  * Free a pcp page
  */
-void free_frozen_pages(struct page *page, unsigned int order)
+void free_frozen_pages(struct page *page, unsigned int order,
+		     unsigned short luf_key)
 {
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp;
@@ -2765,7 +2823,7 @@ void free_frozen_pages(struct page *page, unsigned int order)
 	int migratetype;
 
 	if (!pcp_allowed_order(order)) {
-		__free_pages_ok(page, order, FPI_NONE);
+		__free_pages_ok(page, order, FPI_NONE, luf_key);
 		return;
 	}
 
@@ -2783,7 +2841,7 @@ void free_frozen_pages(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, pfn, order, FPI_NONE);
+			free_one_page(zone, page, pfn, order, FPI_NONE, luf_key);
 			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
@@ -2792,10 +2850,10 @@ void free_frozen_pages(struct page *page, unsigned int order)
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
-		free_frozen_page_commit(zone, pcp, page, migratetype, order);
+		free_frozen_page_commit(zone, pcp, page, migratetype, order, luf_key);
 		pcp_spin_unlock(pcp);
 	} else {
-		free_one_page(zone, page, pfn, order, FPI_NONE);
+		free_one_page(zone, page, pfn, order, FPI_NONE, luf_key);
 	}
 	pcp_trylock_finish(UP_flags);
 }
@@ -2803,7 +2861,7 @@ void free_frozen_pages(struct page *page, unsigned int order)
 /*
  * Free a batch of folios
  */
-void free_unref_folios(struct folio_batch *folios)
+void free_unref_folios(struct folio_batch *folios, unsigned short luf_key)
 {
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp = NULL;
@@ -2824,7 +2882,7 @@ void free_unref_folios(struct folio_batch *folios)
 		 */
 		if (!pcp_allowed_order(order)) {
 			free_one_page(folio_zone(folio), &folio->page,
-				      pfn, order, FPI_NONE);
+				      pfn, order, FPI_NONE, luf_key);
 			continue;
 		}
 		folio->private = (void *)(unsigned long)order;
@@ -2860,7 +2918,7 @@ void free_unref_folios(struct folio_batch *folios)
 			 */
 			if (is_migrate_isolate(migratetype)) {
 				free_one_page(zone, &folio->page, pfn,
-					      order, FPI_NONE);
+					      order, FPI_NONE, luf_key);
 				continue;
 			}
 
@@ -2873,7 +2931,7 @@ void free_unref_folios(struct folio_batch *folios)
 			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
 				free_one_page(zone, &folio->page, pfn,
-					      order, FPI_NONE);
+					      order, FPI_NONE, luf_key);
 				continue;
 			}
 			locked_zone = zone;
@@ -2888,7 +2946,7 @@ void free_unref_folios(struct folio_batch *folios)
 
 		trace_mm_page_free_batched(&folio->page);
 		free_frozen_page_commit(zone, pcp, &folio->page, migratetype,
-				order);
+				order, luf_key);
 	}
 
 	if (pcp) {
@@ -2980,7 +3038,7 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
 
 	/* Return isolated page to tail of freelist. */
 	__free_one_page(page, page_to_pfn(page), zone, order, mt,
-			FPI_SKIP_REPORT_NOTIFY | FPI_TO_TAIL);
+			FPI_SKIP_REPORT_NOTIFY | FPI_TO_TAIL, 0);
 }
 
 /*
@@ -4866,7 +4924,7 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
 out:
 	if (memcg_kmem_online() && (gfp & __GFP_ACCOUNT) && page &&
 	    unlikely(__memcg_kmem_charge_page(page, gfp, order) != 0)) {
-		free_frozen_pages(page, order);
+		free_frozen_pages(page, order, 0);
 		page = NULL;
 	}
 
@@ -4947,11 +5005,11 @@ void __free_pages(struct page *page, unsigned int order)
 	struct alloc_tag *tag = pgalloc_tag_get(page);
 
 	if (put_page_testzero(page))
-		free_frozen_pages(page, order);
+		free_frozen_pages(page, order, 0);
 	else if (!head) {
 		pgalloc_tag_sub_pages(tag, (1 << order) - 1);
 		while (order-- > 0)
-			free_frozen_pages(page + (1 << order), order);
+			free_frozen_pages(page + (1 << order), order, 0);
 	}
 }
 EXPORT_SYMBOL(__free_pages);
@@ -4982,7 +5040,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
 
 		last = page + (1UL << order);
 		for (page += nr; page < last; page++)
-			__free_pages_ok(page, 0, FPI_TO_TAIL);
+			__free_pages_ok(page, 0, FPI_TO_TAIL, 0);
 	}
 	return (void *)addr;
 }
@@ -7000,7 +7058,7 @@ bool put_page_back_buddy(struct page *page)
 		int migratetype = get_pfnblock_migratetype(page, pfn);
 
 		ClearPageHWPoisonTakenOff(page);
-		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
+		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE, 0);
 		if (TestClearPageHWPoison(page)) {
 			ret = true;
 		}
@@ -7069,7 +7127,7 @@ static void __accept_page(struct zone *zone, unsigned long *flags,
 
 	accept_memory(page_to_phys(page), PAGE_SIZE << MAX_PAGE_ORDER);
 
-	__free_pages_ok(page, MAX_PAGE_ORDER, FPI_TO_TAIL);
+	__free_pages_ok(page, MAX_PAGE_ORDER, FPI_TO_TAIL, 0);
 
 	if (last)
 		static_branch_dec(&zones_with_unaccepted_pages);
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index d2423f30577e4..558622f15a81e 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -86,7 +86,7 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
 
 	if (page_ref_sub_and_test(page, count))
-		free_frozen_pages(page, compound_order(page));
+		free_frozen_pages(page, compound_order(page), 0);
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
@@ -139,7 +139,7 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
 
 		if (unlikely(encoded_page_decode_pfmemalloc(encoded_page))) {
 			free_frozen_pages(page,
-					encoded_page_decode_order(encoded_page));
+					encoded_page_decode_order(encoded_page), 0);
 			goto refill;
 		}
 
@@ -166,6 +166,6 @@ void page_frag_free(void *addr)
 	struct page *page = virt_to_head_page(addr);
 
 	if (unlikely(put_page_testzero(page)))
-		free_frozen_pages(page, compound_order(page));
+		free_frozen_pages(page, compound_order(page), 0);
 }
 EXPORT_SYMBOL(page_frag_free);
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c608e9d728655..04dcea88a0dda 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -258,6 +258,12 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 		WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype));
 	} else {
 		set_pageblock_migratetype(page, migratetype);
+
+		/*
+		 * Do not clear the page's private to keep its luf_key
+		 * unchanged.
+		 */
+
 		__putback_isolated_page(page, order, migratetype);
 	}
 	zone->nr_isolate_pageblock--;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index e4c428e61d8c1..c05afb7a395f1 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -116,6 +116,12 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
 		int mt = get_pageblock_migratetype(page);
 		unsigned int order = get_order(sg->length);
 
+		/*
+		 * Ensure private is zero before putting into the
+		 * allocator.
+		 */
+		set_page_private(page, 0);
+
 		__putback_isolated_page(page, order, mt);
 
 		/* If the pages were not reported due to error skip flagging */
diff --git a/mm/slub.c b/mm/slub.c
index 1f50129dcfb3c..2cc3bf0f58bce 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2652,7 +2652,7 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
 	__folio_clear_slab(folio);
 	mm_account_reclaimed_pages(pages);
 	unaccount_slab(slab, order, s);
-	free_frozen_pages(&folio->page, order);
+	free_frozen_pages(&folio->page, order, 0);
 }
 
 static void rcu_free_slab(struct rcu_head *h)
diff --git a/mm/swap.c b/mm/swap.c
index fc8281ef42415..0c6198e4a8ee4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -109,7 +109,7 @@ void __folio_put(struct folio *folio)
 	page_cache_release(folio);
 	folio_unqueue_deferred_split(folio);
 	mem_cgroup_uncharge(folio);
-	free_frozen_pages(&folio->page, folio_order(folio));
+	free_frozen_pages(&folio->page, folio_order(folio), 0);
 }
 EXPORT_SYMBOL(__folio_put);
 
@@ -991,7 +991,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 
 	folios->nr = j;
 	mem_cgroup_uncharge_folios(folios);
-	free_unref_folios(folios);
+	free_unref_folios(folios, 0);
 }
 EXPORT_SYMBOL(folios_put_refs);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c767d71c43d7d..ff1c53e769398 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1515,7 +1515,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if (folio_batch_add(&free_folios, folio) == 0) {
 			mem_cgroup_uncharge_folios(&free_folios);
 			try_to_unmap_flush();
-			free_unref_folios(&free_folios);
+			free_unref_folios(&free_folios, 0);
 		}
 		continue;
 
@@ -1584,7 +1584,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 	mem_cgroup_uncharge_folios(&free_folios);
 	try_to_unmap_flush();
-	free_unref_folios(&free_folios);
+	free_unref_folios(&free_folios, 0);
 
 	list_splice(&ret_folios, folio_list);
 	count_vm_events(PGACTIVATE, pgactivate);
@@ -1908,7 +1908,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 			if (folio_batch_add(&free_folios, folio) == 0) {
 				spin_unlock_irq(&lruvec->lru_lock);
 				mem_cgroup_uncharge_folios(&free_folios);
-				free_unref_folios(&free_folios);
+				free_unref_folios(&free_folios, 0);
 				spin_lock_irq(&lruvec->lru_lock);
 			}
 
@@ -1930,7 +1930,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 	if (free_folios.nr) {
 		spin_unlock_irq(&lruvec->lru_lock);
 		mem_cgroup_uncharge_folios(&free_folios);
-		free_unref_folios(&free_folios);
+		free_unref_folios(&free_folios, 0);
 		spin_lock_irq(&lruvec->lru_lock);
 	}
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 12/25] mm: delimit critical sections to take off pages from pcp or buddy alloctor
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (9 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 11/25] mm: deliver luf_key to pcp or buddy on free after unmapping Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 13/25] mm: introduce pend_list in struct free_area to track luf'd pages Byungchul Park
                             ` (12 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Now that luf mechanism has been introduced, tlb shootdown might be
necessary when luf'd pages exit from pcp or buddy allocator.  Check if
it's okay to take off pages and can perform for luf'd pages before use.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/compaction.c     | 32 ++++++++++++++++--
 mm/internal.h       |  2 +-
 mm/page_alloc.c     | 79 +++++++++++++++++++++++++++++++++++++++++++--
 mm/page_isolation.c |  4 ++-
 mm/page_reporting.c | 20 +++++++++++-
 5 files changed, 129 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 12ed8425fa175..e26736d5b7b9c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -606,6 +606,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 
 	page = pfn_to_page(blockpfn);
 
+	luf_takeoff_start();
 	/* Isolate free pages. */
 	for (; blockpfn < end_pfn; blockpfn += stride, page += stride) {
 		int isolated;
@@ -654,9 +655,12 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 				goto isolate_fail;
 		}
 
+		if (!luf_takeoff_check(page))
+			goto isolate_fail;
+
 		/* Found a free page, will break it into order-0 pages */
 		order = buddy_order(page);
-		isolated = __isolate_free_page(page, order);
+		isolated = __isolate_free_page(page, order, false);
 		if (!isolated)
 			break;
 		set_page_private(page, order);
@@ -684,6 +688,11 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	if (locked)
 		spin_unlock_irqrestore(&cc->zone->lock, flags);
 
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
+
 	/*
 	 * Be careful to not go outside of the pageblock.
 	 */
@@ -1591,6 +1600,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 		if (!area->nr_free)
 			continue;
 
+		luf_takeoff_start();
 		spin_lock_irqsave(&cc->zone->lock, flags);
 		freelist = &area->free_list[MIGRATE_MOVABLE];
 		list_for_each_entry_reverse(freepage, freelist, buddy_list) {
@@ -1598,6 +1608,10 @@ static void fast_isolate_freepages(struct compact_control *cc)
 
 			order_scanned++;
 			nr_scanned++;
+
+			if (!luf_takeoff_check(freepage))
+				goto scan_next;
+
 			pfn = page_to_pfn(freepage);
 
 			if (pfn >= highest)
@@ -1617,7 +1631,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 				/* Shorten the scan if a candidate is found */
 				limit >>= 1;
 			}
-
+scan_next:
 			if (order_scanned >= limit)
 				break;
 		}
@@ -1635,7 +1649,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 
 		/* Isolate the page if available */
 		if (page) {
-			if (__isolate_free_page(page, order)) {
+			if (__isolate_free_page(page, order, false)) {
 				set_page_private(page, order);
 				nr_isolated = 1 << order;
 				nr_scanned += nr_isolated - 1;
@@ -1652,6 +1666,11 @@ static void fast_isolate_freepages(struct compact_control *cc)
 
 		spin_unlock_irqrestore(&cc->zone->lock, flags);
 
+		/*
+		 * Check and flush before using the pages taken off.
+		 */
+		luf_takeoff_end();
+
 		/* Skip fast search if enough freepages isolated */
 		if (cc->nr_freepages >= cc->nr_migratepages)
 			break;
@@ -2372,7 +2391,14 @@ static enum compact_result compact_finished(struct compact_control *cc)
 {
 	int ret;
 
+	/*
+	 * luf_takeoff_{start,end}() is required to identify whether
+	 * this compaction context is tlb shootdownable for luf'd pages.
+	 */
+	luf_takeoff_start();
 	ret = __compact_finished(cc);
+	luf_takeoff_end();
+
 	trace_mm_compaction_finished(cc->zone, cc->order, ret);
 	if (ret == COMPACT_NO_SUITABLE_PAGE)
 		ret = COMPACT_CONTINUE;
diff --git a/mm/internal.h b/mm/internal.h
index 47d3291278e81..9426ff6346d44 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -664,7 +664,7 @@ static inline void clear_zone_contiguous(struct zone *zone)
 	zone->contiguous = false;
 }
 
-extern int __isolate_free_page(struct page *page, unsigned int order);
+extern int __isolate_free_page(struct page *page, unsigned int order, bool willputback);
 extern void __putback_isolated_page(struct page *page, unsigned int order,
 				    int mt);
 extern void memblock_free_pages(struct page *page, unsigned long pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2d23bbd60467..325f07c34cfdc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -869,8 +869,13 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 static inline struct page *get_page_from_free_area(struct free_area *area,
 					    int migratetype)
 {
-	return list_first_entry_or_null(&area->free_list[migratetype],
+	struct page *page = list_first_entry_or_null(&area->free_list[migratetype],
 					struct page, buddy_list);
+
+	if (page && luf_takeoff_check(page))
+		return page;
+
+	return NULL;
 }
 
 /*
@@ -1575,6 +1580,8 @@ static __always_inline void page_del_and_expand(struct zone *zone,
 	int nr_pages = 1 << high;
 
 	__del_page_from_free_list(page, zone, high, migratetype);
+	if (unlikely(!luf_takeoff_check_and_fold(page)))
+		VM_WARN_ON(1);
 	nr_pages -= expand(zone, page, low, high, migratetype);
 	account_freepages(zone, -nr_pages, migratetype);
 }
@@ -1945,6 +1952,13 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
 
 		del_page_from_free_list(buddy, zone, order,
 					get_pfnblock_migratetype(buddy, pfn));
+
+		/*
+		 * No need to luf_takeoff_check_and_fold() since it's
+		 * going back to buddy. luf_key will be handed over in
+		 * split_large_buddy().
+		 */
+
 		set_pageblock_migratetype(page, migratetype);
 		split_large_buddy(zone, buddy, pfn, order, FPI_NONE);
 		return true;
@@ -1956,6 +1970,13 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
 
 		del_page_from_free_list(page, zone, order,
 					get_pfnblock_migratetype(page, pfn));
+
+		/*
+		 * No need to luf_takeoff_check_and_fold() since it's
+		 * going back to buddy. luf_key will be handed over in
+		 * split_large_buddy().
+		 */
+
 		set_pageblock_migratetype(page, migratetype);
 		split_large_buddy(zone, page, pfn, order, FPI_NONE);
 		return true;
@@ -2088,6 +2109,8 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
 		unsigned int nr_added;
 
 		del_page_from_free_list(page, zone, current_order, block_type);
+		if (unlikely(!luf_takeoff_check_and_fold(page)))
+			VM_WARN_ON(1);
 		change_pageblock_range(page, current_order, start_type);
 		nr_added = expand(zone, page, order, current_order, start_type);
 		account_freepages(zone, nr_added, start_type);
@@ -2168,6 +2191,9 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 		if (free_area_empty(area, fallback_mt))
 			continue;
 
+		if (luf_takeoff_no_shootdown())
+			continue;
+
 		if (can_steal_fallback(order, migratetype))
 			*can_steal = true;
 
@@ -2259,6 +2285,11 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 					pageblock_nr_pages)
 			continue;
 
+		/*
+		 * luf_takeoff_{start,end}() is required for
+		 * get_page_from_free_area() to use luf_takeoff_check().
+		 */
+		luf_takeoff_start();
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < NR_PAGE_ORDERS; order++) {
 			struct free_area *area = &(zone->free_area[order]);
@@ -2316,10 +2347,12 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			WARN_ON_ONCE(ret == -1);
 			if (ret > 0) {
 				spin_unlock_irqrestore(&zone->lock, flags);
+				luf_takeoff_end();
 				return ret;
 			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
+		luf_takeoff_end();
 	}
 
 	return false;
@@ -2461,6 +2494,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	unsigned long flags;
 	int i;
 
+	luf_takeoff_start();
 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
@@ -2485,6 +2519,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		list_add_tail(&page->pcp_list, list);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 
 	return i;
 }
@@ -2979,7 +3017,7 @@ void split_page(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL_GPL(split_page);
 
-int __isolate_free_page(struct page *page, unsigned int order)
+int __isolate_free_page(struct page *page, unsigned int order, bool willputback)
 {
 	struct zone *zone = page_zone(page);
 	int mt = get_pageblock_migratetype(page);
@@ -2998,6 +3036,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	del_page_from_free_list(page, zone, order, mt);
+	if (unlikely(!willputback && !luf_takeoff_check_and_fold(page)))
+		VM_WARN_ON(1);
 
 	/*
 	 * Set the pageblock if the isolated page is at least half of a
@@ -3077,6 +3117,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 	do {
 		page = NULL;
+		luf_takeoff_start();
 		spin_lock_irqsave(&zone->lock, flags);
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
@@ -3094,10 +3135,15 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 			if (!page) {
 				spin_unlock_irqrestore(&zone->lock, flags);
+				luf_takeoff_end();
 				return NULL;
 			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
+		/*
+		 * Check and flush before using the pages taken off.
+		 */
+		luf_takeoff_end();
 	} while (check_new_pages(page, order));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3181,6 +3227,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		}
 
 		page = list_first_entry(list, struct page, pcp_list);
+		if (!luf_takeoff_check_and_fold(page))
+			return NULL;
 		list_del(&page->pcp_list);
 		pcp->count -= 1 << order;
 	} while (check_new_pages(page, order));
@@ -3198,11 +3246,13 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct page *page;
 	unsigned long __maybe_unused UP_flags;
 
+	luf_takeoff_start();
 	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (!pcp) {
 		pcp_trylock_finish(UP_flags);
+		luf_takeoff_end();
 		return NULL;
 	}
 
@@ -3216,6 +3266,10 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
 	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone, 1);
@@ -4814,6 +4868,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	if (unlikely(!zone))
 		goto failed;
 
+	luf_takeoff_start();
 	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
@@ -4849,6 +4904,10 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 
 	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
 	zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_account);
@@ -4858,6 +4917,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 
 failed_irq:
 	pcp_trylock_finish(UP_flags);
+	luf_takeoff_end();
 
 failed:
 	page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
@@ -6912,6 +6972,7 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 
 	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
+	luf_takeoff_start();
 	spin_lock_irqsave(&zone->lock, flags);
 	while (pfn < end_pfn) {
 		page = pfn_to_page(pfn);
@@ -6940,9 +7001,15 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 		VM_WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE);
 		order = buddy_order(page);
 		del_page_from_free_list(page, zone, order, MIGRATE_ISOLATE);
+		if (unlikely(!luf_takeoff_check_and_fold(page)))
+			VM_WARN_ON(1);
 		pfn += (1 << order);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 
 	return end_pfn - start_pfn - already_offline;
 }
@@ -7018,6 +7085,7 @@ bool take_page_off_buddy(struct page *page)
 	unsigned int order;
 	bool ret = false;
 
+	luf_takeoff_start();
 	spin_lock_irqsave(&zone->lock, flags);
 	for (order = 0; order < NR_PAGE_ORDERS; order++) {
 		struct page *page_head = page - (pfn & ((1 << order) - 1));
@@ -7030,6 +7098,8 @@ bool take_page_off_buddy(struct page *page)
 
 			del_page_from_free_list(page_head, zone, page_order,
 						migratetype);
+			if (unlikely(!luf_takeoff_check_and_fold(page_head)))
+				VM_WARN_ON(1);
 			break_down_buddy_pages(zone, page_head, page, 0,
 						page_order, migratetype);
 			SetPageHWPoisonTakenOff(page);
@@ -7040,6 +7110,11 @@ bool take_page_off_buddy(struct page *page)
 			break;
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
+
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 	return ret;
 }
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 04dcea88a0dda..c34659b58ca6c 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -211,6 +211,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	struct page *buddy;
 
 	zone = page_zone(page);
+	luf_takeoff_start();
 	spin_lock_irqsave(&zone->lock, flags);
 	if (!is_migrate_isolate_page(page))
 		goto out;
@@ -229,7 +230,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 			buddy = find_buddy_page_pfn(page, page_to_pfn(page),
 						    order, NULL);
 			if (buddy && !is_migrate_isolate_page(buddy)) {
-				isolated_page = !!__isolate_free_page(page, order);
+				isolated_page = !!__isolate_free_page(page, order, true);
 				/*
 				 * Isolating a free page in an isolated pageblock
 				 * is expected to always work as watermarks don't
@@ -269,6 +270,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	zone->nr_isolate_pageblock--;
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
+	luf_takeoff_end(zone);
 }
 
 static inline struct page *
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index c05afb7a395f1..03a7f5f6dc073 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -167,6 +167,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	if (list_empty(list))
 		return err;
 
+	luf_takeoff_start();
 	spin_lock_irq(&zone->lock);
 
 	/*
@@ -191,6 +192,11 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		if (PageReported(page))
 			continue;
 
+		if (!luf_takeoff_check(page)) {
+			VM_WARN_ON(1);
+			continue;
+		}
+
 		/*
 		 * If we fully consumed our budget then update our
 		 * state to indicate that we are requesting additional
@@ -204,7 +210,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 
 		/* Attempt to pull page from list and place in scatterlist */
 		if (*offset) {
-			if (!__isolate_free_page(page, order)) {
+			if (!__isolate_free_page(page, order, false)) {
 				next = page;
 				break;
 			}
@@ -227,6 +233,11 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		/* release lock before waiting on report processing */
 		spin_unlock_irq(&zone->lock);
 
+		/*
+		 * Check and flush before using the pages taken off.
+		 */
+		luf_takeoff_end();
+
 		/* begin processing pages in local list */
 		err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
 
@@ -236,6 +247,8 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		/* update budget to reflect call to report function */
 		budget--;
 
+		luf_takeoff_start();
+
 		/* reacquire zone lock and resume processing */
 		spin_lock_irq(&zone->lock);
 
@@ -259,6 +272,11 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 
 	spin_unlock_irq(&zone->lock);
 
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
+
 	return err;
 }
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 13/25] mm: introduce pend_list in struct free_area to track luf'd pages
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (10 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 12/25] mm: delimit critical sections to take off pages from pcp or buddy alloctor Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 14/25] mm/rmap: recognize read-only tlb entries during batched tlb flush Byungchul Park
                             ` (11 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

luf'd pages requires tlb shootdown on exiting from page allocator. For
some page allocation request, it's okay to return luf'd page followed by
tlb shootdown but it's not okay for e.g. irq context.

This patch splitted the list in free_area into two, 'free_list' for
non-luf'd pages and 'pend_list' for luf'd pages so that the buddy
allocator can work better with various conditions of context.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mmzone.h  |   3 ++
 kernel/power/snapshot.c |  14 ++++++
 kernel/vmcore_info.c    |   2 +
 mm/compaction.c         |  33 ++++++++++---
 mm/internal.h           |  17 ++++++-
 mm/mm_init.c            |   2 +
 mm/page_alloc.c         | 105 ++++++++++++++++++++++++++++++++++------
 mm/page_reporting.c     |  22 ++++++---
 mm/vmstat.c             |  15 ++++++
 9 files changed, 184 insertions(+), 29 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9540b41894da6..e2c8d7147e361 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -116,6 +116,7 @@ extern int page_group_by_mobility_disabled;
 			MIGRATETYPE_MASK)
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
+	struct list_head	pend_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
 };
 
@@ -1014,6 +1015,8 @@ struct zone {
 	/* Zone statistics */
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 	atomic_long_t		vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
+	/* Count pages that need tlb shootdown on allocation */
+	atomic_long_t		nr_luf_pages;
 } ____cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index c9fb559a63993..ca10796855aba 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1285,6 +1285,20 @@ static void mark_free_pages(struct zone *zone)
 				swsusp_set_page_free(pfn_to_page(pfn + i));
 			}
 		}
+
+		list_for_each_entry(page,
+				&zone->free_area[order].pend_list[t], buddy_list) {
+			unsigned long i;
+
+			pfn = page_to_pfn(page);
+			for (i = 0; i < (1UL << order); i++) {
+				if (!--page_count) {
+					touch_nmi_watchdog();
+					page_count = WD_PAGE_COUNT;
+				}
+				swsusp_set_page_free(pfn_to_page(pfn + i));
+			}
+		}
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index 1fec61603ef32..638deb57f9ddd 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -188,11 +188,13 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_OFFSET(zone, vm_stat);
 	VMCOREINFO_OFFSET(zone, spanned_pages);
 	VMCOREINFO_OFFSET(free_area, free_list);
+	VMCOREINFO_OFFSET(free_area, pend_list);
 	VMCOREINFO_OFFSET(list_head, next);
 	VMCOREINFO_OFFSET(list_head, prev);
 	VMCOREINFO_LENGTH(zone.free_area, NR_PAGE_ORDERS);
 	log_buf_vmcoreinfo_setup();
 	VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
+	VMCOREINFO_LENGTH(free_area.pend_list, MIGRATE_TYPES);
 	VMCOREINFO_NUMBER(NR_FREE_PAGES);
 	VMCOREINFO_NUMBER(PG_lru);
 	VMCOREINFO_NUMBER(PG_private);
diff --git a/mm/compaction.c b/mm/compaction.c
index e26736d5b7b9c..aa594a85d8aee 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1592,24 +1592,28 @@ static void fast_isolate_freepages(struct compact_control *cc)
 	     order = next_search_order(cc, order)) {
 		struct free_area *area = &cc->zone->free_area[order];
 		struct list_head *freelist;
+		struct list_head *high_pfn_list;
 		struct page *freepage;
 		unsigned long flags;
 		unsigned int order_scanned = 0;
 		unsigned long high_pfn = 0;
+		bool consider_pend = false;
+		bool can_shootdown;
 
 		if (!area->nr_free)
 			continue;
 
-		luf_takeoff_start();
+		can_shootdown = luf_takeoff_start();
 		spin_lock_irqsave(&cc->zone->lock, flags);
 		freelist = &area->free_list[MIGRATE_MOVABLE];
+retry:
 		list_for_each_entry_reverse(freepage, freelist, buddy_list) {
 			unsigned long pfn;
 
 			order_scanned++;
 			nr_scanned++;
 
-			if (!luf_takeoff_check(freepage))
+			if (unlikely(consider_pend && !luf_takeoff_check(freepage)))
 				goto scan_next;
 
 			pfn = page_to_pfn(freepage);
@@ -1622,26 +1626,34 @@ static void fast_isolate_freepages(struct compact_control *cc)
 				cc->fast_search_fail = 0;
 				cc->search_order = order;
 				page = freepage;
-				break;
+				goto done;
 			}
 
 			if (pfn >= min_pfn && pfn > high_pfn) {
 				high_pfn = pfn;
+				high_pfn_list = freelist;
 
 				/* Shorten the scan if a candidate is found */
 				limit >>= 1;
 			}
 scan_next:
 			if (order_scanned >= limit)
-				break;
+				goto done;
 		}
 
+		if (!consider_pend && can_shootdown) {
+			consider_pend = true;
+			freelist = &area->pend_list[MIGRATE_MOVABLE];
+			goto retry;
+		}
+done:
 		/* Use a maximum candidate pfn if a preferred one was not found */
 		if (!page && high_pfn) {
 			page = pfn_to_page(high_pfn);
 
 			/* Update freepage for the list reorder below */
 			freepage = page;
+			freelist = high_pfn_list;
 		}
 
 		/* Reorder to so a future search skips recent pages */
@@ -2039,18 +2051,20 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
 		struct list_head *freelist;
 		unsigned long flags;
 		struct page *freepage;
+		bool consider_pend = false;
 
 		if (!area->nr_free)
 			continue;
 
 		spin_lock_irqsave(&cc->zone->lock, flags);
 		freelist = &area->free_list[MIGRATE_MOVABLE];
+retry:
 		list_for_each_entry(freepage, freelist, buddy_list) {
 			unsigned long free_pfn;
 
 			if (nr_scanned++ >= limit) {
 				move_freelist_tail(freelist, freepage);
-				break;
+				goto done;
 			}
 
 			free_pfn = page_to_pfn(freepage);
@@ -2073,9 +2087,16 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
 					pfn = cc->zone->zone_start_pfn;
 				cc->fast_search_fail = 0;
 				found_block = true;
-				break;
+				goto done;
 			}
 		}
+
+		if (!consider_pend) {
+			consider_pend = true;
+			freelist = &area->pend_list[MIGRATE_MOVABLE];
+			goto retry;
+		}
+done:
 		spin_unlock_irqrestore(&cc->zone->lock, flags);
 	}
 
diff --git a/mm/internal.h b/mm/internal.h
index 9426ff6346d44..9dbb65f919337 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -849,11 +849,16 @@ void init_cma_reserved_pageblock(struct page *page);
 int find_suitable_fallback(struct free_area *area, unsigned int order,
 			int migratetype, bool only_stealable, bool *can_steal);
 
-static inline bool free_area_empty(struct free_area *area, int migratetype)
+static inline bool free_list_empty(struct free_area *area, int migratetype)
 {
 	return list_empty(&area->free_list[migratetype]);
 }
 
+static inline bool free_area_empty(struct free_area *area, int migratetype)
+{
+	return list_empty(&area->free_list[migratetype]) &&
+	       list_empty(&area->pend_list[migratetype]);
+}
 /* mm/util.c */
 struct anon_vma *folio_anon_vma(const struct folio *folio);
 
@@ -1587,12 +1592,22 @@ void luf_takeoff_end(void);
 bool luf_takeoff_no_shootdown(void);
 bool luf_takeoff_check(struct page *page);
 bool luf_takeoff_check_and_fold(struct page *page);
+
+static inline bool non_luf_pages_ok(struct zone *zone)
+{
+	unsigned long nr_free = zone_page_state(zone, NR_FREE_PAGES);
+	unsigned long min_wm = min_wmark_pages(zone);
+	unsigned long nr_luf_pages = atomic_long_read(&zone->nr_luf_pages);
+
+	return nr_free - nr_luf_pages > min_wm;
+}
 #else
 static inline bool luf_takeoff_start(void) { return false; }
 static inline void luf_takeoff_end(void) {}
 static inline bool luf_takeoff_no_shootdown(void) { return true; }
 static inline bool luf_takeoff_check(struct page *page) { return true; }
 static inline bool luf_takeoff_check_and_fold(struct page *page) { return true; }
+static inline bool non_luf_pages_ok(struct zone *zone) { return true; }
 #endif
 
 /* pagewalk.c */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 2630cc30147e0..41c38fbb58a30 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1399,12 +1399,14 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	unsigned int order, t;
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
+		INIT_LIST_HEAD(&zone->free_area[order].pend_list[t]);
 		zone->free_area[order].nr_free = 0;
 	}
 
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	INIT_LIST_HEAD(&zone->unaccepted_pages);
 #endif
+	atomic_long_set(&zone->nr_luf_pages, 0);
 }
 
 void __meminit init_currently_empty_zone(struct zone *zone,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 325f07c34cfdc..db1460c07b964 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -804,15 +804,28 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
 				      bool tail)
 {
 	struct free_area *area = &zone->free_area[order];
+	struct list_head *list;
 
 	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
 		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
 		     get_pageblock_migratetype(page), migratetype, 1 << order);
 
+	/*
+	 * When identifying whether a page requires tlb shootdown, false
+	 * positive is okay because it will cause just additional tlb
+	 * shootdown.
+	 */
+	if (page_luf_key(page)) {
+		list = &area->pend_list[migratetype];
+		atomic_long_add(1 << order, &zone->nr_luf_pages);
+	} else
+		list = &area->free_list[migratetype];
+
 	if (tail)
-		list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
+		list_add_tail(&page->buddy_list, list);
 	else
-		list_add(&page->buddy_list, &area->free_list[migratetype]);
+		list_add(&page->buddy_list, list);
+
 	area->nr_free++;
 }
 
@@ -831,7 +844,20 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
 		     get_pageblock_migratetype(page), old_mt, 1 << order);
 
-	list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
+	/*
+	 * The page might have been taken from a pfn where it's not
+	 * clear which list was used.  Therefore, conservatively
+	 * consider it as pend_list, not to miss any true ones that
+	 * require tlb shootdown.
+	 *
+	 * When identifying whether a page requires tlb shootdown, false
+	 * positive is okay because it will cause just additional tlb
+	 * shootdown.
+	 */
+	if (page_luf_key(page))
+		list_move_tail(&page->buddy_list, &area->pend_list[new_mt]);
+	else
+		list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
 
 	account_freepages(zone, -(1 << order), old_mt);
 	account_freepages(zone, 1 << order, new_mt);
@@ -848,6 +874,9 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
 	if (page_reported(page))
 		__ClearPageReported(page);
 
+	if (page_luf_key(page))
+		atomic_long_sub(1 << order, &zone->nr_luf_pages);
+
 	list_del(&page->buddy_list);
 	__ClearPageBuddy(page);
 	zone->free_area[order].nr_free--;
@@ -866,15 +895,48 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 	account_freepages(zone, -(1 << order), migratetype);
 }
 
-static inline struct page *get_page_from_free_area(struct free_area *area,
-					    int migratetype)
+static inline struct page *get_page_from_free_area(struct zone *zone,
+		struct free_area *area, int migratetype)
 {
-	struct page *page = list_first_entry_or_null(&area->free_list[migratetype],
-					struct page, buddy_list);
+	struct page *page;
+	bool pend_first;
 
-	if (page && luf_takeoff_check(page))
-		return page;
+	/*
+	 * XXX: Make the decision preciser if needed e.g. using
+	 * zone_watermark_ok() or its family, but for now, don't want to
+	 * make it heavier.
+	 *
+	 * Try free_list, holding non-luf pages, first if there are
+	 * enough non-luf pages to aggressively defer tlb flush, but
+	 * should try pend_list first instead if not.
+	 */
+	pend_first = !non_luf_pages_ok(zone);
+
+	if (pend_first) {
+		page = list_first_entry_or_null(&area->pend_list[migratetype],
+				struct page, buddy_list);
+
+		if (page && luf_takeoff_check(page))
+			return page;
+
+		page = list_first_entry_or_null(&area->free_list[migratetype],
+				struct page, buddy_list);
+
+		if (page)
+			return page;
+	} else {
+		page = list_first_entry_or_null(&area->free_list[migratetype],
+				struct page, buddy_list);
+
+		if (page)
+			return page;
 
+		page = list_first_entry_or_null(&area->pend_list[migratetype],
+				struct page, buddy_list);
+
+		if (page && luf_takeoff_check(page))
+			return page;
+	}
 	return NULL;
 }
 
@@ -1027,6 +1089,8 @@ static inline void __free_one_page(struct page *page,
 
 	if (fpi_flags & FPI_TO_TAIL)
 		to_tail = true;
+	else if (page_luf_key(page))
+		to_tail = true;
 	else if (is_shuffle_order(order))
 		to_tail = shuffle_pick_tail();
 	else
@@ -1552,6 +1616,8 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 	unsigned int nr_added = 0;
 
 	while (high > low) {
+		bool tail = false;
+
 		high--;
 		size >>= 1;
 		VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);
@@ -1565,7 +1631,10 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 		if (set_page_guard(zone, &page[size], high))
 			continue;
 
-		__add_to_free_list(&page[size], zone, high, migratetype, false);
+		if (page_luf_key(&page[size]))
+			tail = true;
+
+		__add_to_free_list(&page[size], zone, high, migratetype, tail);
 		set_buddy_order(&page[size], high);
 		nr_added += size;
 	}
@@ -1749,7 +1818,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) {
 		area = &(zone->free_area[current_order]);
-		page = get_page_from_free_area(area, migratetype);
+		page = get_page_from_free_area(zone, area, migratetype);
 		if (!page)
 			continue;
 
@@ -2191,7 +2260,8 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 		if (free_area_empty(area, fallback_mt))
 			continue;
 
-		if (luf_takeoff_no_shootdown())
+		if (free_list_empty(area, fallback_mt) &&
+		    luf_takeoff_no_shootdown())
 			continue;
 
 		if (can_steal_fallback(order, migratetype))
@@ -2295,7 +2365,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			struct free_area *area = &(zone->free_area[order]);
 			int mt;
 
-			page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
+			page = get_page_from_free_area(zone, area, MIGRATE_HIGHATOMIC);
 			if (!page)
 				continue;
 
@@ -2433,7 +2503,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 	VM_BUG_ON(current_order > MAX_PAGE_ORDER);
 
 do_steal:
-	page = get_page_from_free_area(area, fallback_mt);
+	page = get_page_from_free_area(zone, area, fallback_mt);
 
 	/* take off list, maybe claim block, expand remainder */
 	page = steal_suitable_fallback(zone, page, current_order, order,
@@ -7056,6 +7126,8 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
 	struct page *current_buddy;
 
 	while (high > low) {
+		bool tail = false;
+
 		high--;
 		size >>= 1;
 
@@ -7069,7 +7141,10 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, current_buddy, high))
 			continue;
 
-		add_to_free_list(current_buddy, zone, high, migratetype, false);
+		if (page_luf_key(current_buddy))
+			tail = true;
+
+		add_to_free_list(current_buddy, zone, high, migratetype, tail);
 		set_buddy_order(current_buddy, high);
 	}
 }
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 03a7f5f6dc073..e152b22fbba8a 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -159,15 +159,17 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	struct page *page, *next;
 	long budget;
 	int err = 0;
+	bool consider_pend = false;
+	bool can_shootdown;
 
 	/*
 	 * Perform early check, if free area is empty there is
 	 * nothing to process so we can skip this free_list.
 	 */
-	if (list_empty(list))
+	if (free_area_empty(area, mt))
 		return err;
 
-	luf_takeoff_start();
+	can_shootdown = luf_takeoff_start();
 	spin_lock_irq(&zone->lock);
 
 	/*
@@ -185,14 +187,14 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	 * should always be a power of 2.
 	 */
 	budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
-
+retry:
 	/* loop through free list adding unreported pages to sg list */
 	list_for_each_entry_safe(page, next, list, lru) {
 		/* We are going to skip over the reported pages. */
 		if (PageReported(page))
 			continue;
 
-		if (!luf_takeoff_check(page)) {
+		if (unlikely(consider_pend && !luf_takeoff_check(page))) {
 			VM_WARN_ON(1);
 			continue;
 		}
@@ -205,14 +207,14 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		if (budget < 0) {
 			atomic_set(&prdev->state, PAGE_REPORTING_REQUESTED);
 			next = page;
-			break;
+			goto done;
 		}
 
 		/* Attempt to pull page from list and place in scatterlist */
 		if (*offset) {
 			if (!__isolate_free_page(page, order, false)) {
 				next = page;
-				break;
+				goto done;
 			}
 
 			/* Add page to scatter list */
@@ -263,9 +265,15 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 
 		/* exit on error */
 		if (err)
-			break;
+			goto done;
 	}
 
+	if (!consider_pend && can_shootdown) {
+		consider_pend = true;
+		list = &area->pend_list[mt];
+		goto retry;
+	}
+done:
 	/* Rotate any leftover pages to the head of the freelist */
 	if (!list_entry_is_head(next, list, lru) && !list_is_first(&next->lru, list))
 		list_rotate_to_front(&next->lru, list);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 16bfe1c694dd4..5ae5ac9f0a4a9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1581,6 +1581,21 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 					break;
 				}
 			}
+			list_for_each(curr, &area->pend_list[mtype]) {
+				/*
+				 * Cap the pend_list iteration because it might
+				 * be really large and we are under a spinlock
+				 * so a long time spent here could trigger a
+				 * hard lockup detector. Anyway this is a
+				 * debugging tool so knowing there is a handful
+				 * of pages of this order should be more than
+				 * sufficient.
+				 */
+				if (++freecount >= 100000) {
+					overflow = true;
+					break;
+				}
+			}
 			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
 			spin_unlock_irq(&zone->lock);
 			cond_resched();
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 14/25] mm/rmap: recognize read-only tlb entries during batched tlb flush
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (11 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 13/25] mm: introduce pend_list in struct free_area to track luf'd pages Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 15/25] fs, filemap: refactor to gather the scattered ->write_{begin,end}() calls Byungchul Park
                             ` (10 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
requires to recognize read-only tlb entries and handle them in a
different way.  The newly introduced API in this patch, fold_ubc(), will
be used by luf mechanism.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/sched.h |  1 +
 mm/rmap.c             | 16 ++++++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a3049ea5b3ad3..d1a3c97491ff2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1407,6 +1407,7 @@ struct task_struct {
 
 	struct tlbflush_unmap_batch	tlb_ubc;
 	struct tlbflush_unmap_batch	tlb_ubc_takeoff;
+	struct tlbflush_unmap_batch	tlb_ubc_ro;
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/mm/rmap.c b/mm/rmap.c
index 1581b1a00f974..3ed6234dd777e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -775,6 +775,7 @@ void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
 void try_to_unmap_flush_takeoff(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 
 	if (!tlb_ubc_takeoff->flush_required)
@@ -789,6 +790,9 @@ void try_to_unmap_flush_takeoff(void)
 	if (arch_tlbbatch_done(&tlb_ubc->arch, &tlb_ubc_takeoff->arch))
 		reset_batch(tlb_ubc);
 
+	if (arch_tlbbatch_done(&tlb_ubc_ro->arch, &tlb_ubc_takeoff->arch))
+		reset_batch(tlb_ubc_ro);
+
 	reset_batch(tlb_ubc_takeoff);
 }
 
@@ -801,7 +805,9 @@ void try_to_unmap_flush_takeoff(void)
 void try_to_unmap_flush(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 
+	fold_batch(tlb_ubc, tlb_ubc_ro, true);
 	if (!tlb_ubc->flush_required)
 		return;
 
@@ -813,8 +819,9 @@ void try_to_unmap_flush(void)
 void try_to_unmap_flush_dirty(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 
-	if (tlb_ubc->writable)
+	if (tlb_ubc->writable || tlb_ubc_ro->writable)
 		try_to_unmap_flush();
 }
 
@@ -831,13 +838,18 @@ void try_to_unmap_flush_dirty(void)
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 				      unsigned long uaddr)
 {
-	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc;
 	int batch;
 	bool writable = pte_dirty(pteval);
 
 	if (!pte_accessible(mm, pteval))
 		return;
 
+	if (pte_write(pteval))
+		tlb_ubc = &current->tlb_ubc;
+	else
+		tlb_ubc = &current->tlb_ubc_ro;
+
 	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 15/25] fs, filemap: refactor to gather the scattered ->write_{begin,end}() calls
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (12 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 14/25] mm/rmap: recognize read-only tlb entries during batched tlb flush Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 16/25] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped Byungchul Park
                             ` (9 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
requires to hook when updating page cache that might have pages that
have been mapped on any tasks so that tlb flush needed can be performed.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 11 ++++-------
 fs/affs/file.c                            |  4 ++--
 fs/buffer.c                               | 14 ++++++--------
 fs/exfat/file.c                           |  5 ++---
 fs/ext4/verity.c                          |  5 ++---
 fs/f2fs/super.c                           |  5 ++---
 fs/f2fs/verity.c                          |  5 ++---
 fs/namei.c                                |  5 ++---
 include/linux/fs.h                        | 18 ++++++++++++++++++
 mm/filemap.c                              |  5 ++---
 10 files changed, 42 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index ae3343c81a645..22ce009d13689 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -418,7 +418,6 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
 	     const struct drm_i915_gem_pwrite *arg)
 {
 	struct address_space *mapping = obj->base.filp->f_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 	char __user *user_data = u64_to_user_ptr(arg->data_ptr);
 	u64 remain;
 	loff_t pos;
@@ -477,7 +476,7 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
 		if (err)
 			return err;
 
-		err = aops->write_begin(obj->base.filp, mapping, pos, len,
+		err = mapping_write_begin(obj->base.filp, mapping, pos, len,
 					&folio, &data);
 		if (err < 0)
 			return err;
@@ -488,7 +487,7 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
 		pagefault_enable();
 		kunmap_local(vaddr);
 
-		err = aops->write_end(obj->base.filp, mapping, pos, len,
+		err = mapping_write_end(obj->base.filp, mapping, pos, len,
 				      len - unwritten, folio, data);
 		if (err < 0)
 			return err;
@@ -654,7 +653,6 @@ i915_gem_object_create_shmem_from_data(struct drm_i915_private *i915,
 {
 	struct drm_i915_gem_object *obj;
 	struct file *file;
-	const struct address_space_operations *aops;
 	loff_t pos;
 	int err;
 
@@ -666,21 +664,20 @@ i915_gem_object_create_shmem_from_data(struct drm_i915_private *i915,
 	GEM_BUG_ON(obj->write_domain != I915_GEM_DOMAIN_CPU);
 
 	file = obj->base.filp;
-	aops = file->f_mapping->a_ops;
 	pos = 0;
 	do {
 		unsigned int len = min_t(typeof(size), size, PAGE_SIZE);
 		struct folio *folio;
 		void *fsdata;
 
-		err = aops->write_begin(file, file->f_mapping, pos, len,
+		err = mapping_write_begin(file, file->f_mapping, pos, len,
 					&folio, &fsdata);
 		if (err < 0)
 			goto fail;
 
 		memcpy_to_folio(folio, offset_in_folio(folio, pos), data, len);
 
-		err = aops->write_end(file, file->f_mapping, pos, len, len,
+		err = mapping_write_end(file, file->f_mapping, pos, len, len,
 				      folio, fsdata);
 		if (err < 0)
 			goto fail;
diff --git a/fs/affs/file.c b/fs/affs/file.c
index a5a861dd52230..10e7f53828e93 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -885,9 +885,9 @@ affs_truncate(struct inode *inode)
 		loff_t isize = inode->i_size;
 		int res;
 
-		res = mapping->a_ops->write_begin(NULL, mapping, isize, 0, &folio, &fsdata);
+		res = mapping_write_begin(NULL, mapping, isize, 0, &folio, &fsdata);
 		if (!res)
-			res = mapping->a_ops->write_end(NULL, mapping, isize, 0, 0, folio, fsdata);
+			res = mapping_write_end(NULL, mapping, isize, 0, 0, folio, fsdata);
 		else
 			inode->i_size = AFFS_I(inode)->mmu_private;
 		mark_inode_dirty(inode);
diff --git a/fs/buffer.c b/fs/buffer.c
index cc8452f602516..f54fce7729bf1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2456,7 +2456,6 @@ EXPORT_SYMBOL(block_read_full_folio);
 int generic_cont_expand_simple(struct inode *inode, loff_t size)
 {
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 	struct folio *folio;
 	void *fsdata = NULL;
 	int err;
@@ -2465,11 +2464,11 @@ int generic_cont_expand_simple(struct inode *inode, loff_t size)
 	if (err)
 		goto out;
 
-	err = aops->write_begin(NULL, mapping, size, 0, &folio, &fsdata);
+	err = mapping_write_begin(NULL, mapping, size, 0, &folio, &fsdata);
 	if (err)
 		goto out;
 
-	err = aops->write_end(NULL, mapping, size, 0, 0, folio, fsdata);
+	err = mapping_write_end(NULL, mapping, size, 0, 0, folio, fsdata);
 	BUG_ON(err > 0);
 
 out:
@@ -2481,7 +2480,6 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
 			    loff_t pos, loff_t *bytes)
 {
 	struct inode *inode = mapping->host;
-	const struct address_space_operations *aops = mapping->a_ops;
 	unsigned int blocksize = i_blocksize(inode);
 	struct folio *folio;
 	void *fsdata = NULL;
@@ -2501,12 +2499,12 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
 		}
 		len = PAGE_SIZE - zerofrom;
 
-		err = aops->write_begin(file, mapping, curpos, len,
+		err = mapping_write_begin(file, mapping, curpos, len,
 					    &folio, &fsdata);
 		if (err)
 			goto out;
 		folio_zero_range(folio, offset_in_folio(folio, curpos), len);
-		err = aops->write_end(file, mapping, curpos, len, len,
+		err = mapping_write_end(file, mapping, curpos, len, len,
 						folio, fsdata);
 		if (err < 0)
 			goto out;
@@ -2534,12 +2532,12 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
 		}
 		len = offset - zerofrom;
 
-		err = aops->write_begin(file, mapping, curpos, len,
+		err = mapping_write_begin(file, mapping, curpos, len,
 					    &folio, &fsdata);
 		if (err)
 			goto out;
 		folio_zero_range(folio, offset_in_folio(folio, curpos), len);
-		err = aops->write_end(file, mapping, curpos, len, len,
+		err = mapping_write_end(file, mapping, curpos, len, len,
 						folio, fsdata);
 		if (err < 0)
 			goto out;
diff --git a/fs/exfat/file.c b/fs/exfat/file.c
index 05b51e7217838..9a1002761f79f 100644
--- a/fs/exfat/file.c
+++ b/fs/exfat/file.c
@@ -539,7 +539,6 @@ static int exfat_extend_valid_size(struct file *file, loff_t new_valid_size)
 	struct inode *inode = file_inode(file);
 	struct exfat_inode_info *ei = EXFAT_I(inode);
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *ops = mapping->a_ops;
 
 	pos = ei->valid_size;
 	while (pos < new_valid_size) {
@@ -551,14 +550,14 @@ static int exfat_extend_valid_size(struct file *file, loff_t new_valid_size)
 		if (pos + len > new_valid_size)
 			len = new_valid_size - pos;
 
-		err = ops->write_begin(file, mapping, pos, len, &folio, NULL);
+		err = mapping_write_begin(file, mapping, pos, len, &folio, NULL);
 		if (err)
 			goto out;
 
 		off = offset_in_folio(folio, pos);
 		folio_zero_new_buffers(folio, off, off + len);
 
-		err = ops->write_end(file, mapping, pos, len, len, folio, NULL);
+		err = mapping_write_end(file, mapping, pos, len, len, folio, NULL);
 		if (err < 0)
 			goto out;
 		pos += len;
diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index d9203228ce979..64fa43f80c73e 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -68,7 +68,6 @@ static int pagecache_write(struct inode *inode, const void *buf, size_t count,
 			   loff_t pos)
 {
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 
 	if (pos + count > inode->i_sb->s_maxbytes)
 		return -EFBIG;
@@ -80,13 +79,13 @@ static int pagecache_write(struct inode *inode, const void *buf, size_t count,
 		void *fsdata = NULL;
 		int res;
 
-		res = aops->write_begin(NULL, mapping, pos, n, &folio, &fsdata);
+		res = mapping_write_begin(NULL, mapping, pos, n, &folio, &fsdata);
 		if (res)
 			return res;
 
 		memcpy_to_folio(folio, offset_in_folio(folio, pos), buf, n);
 
-		res = aops->write_end(NULL, mapping, pos, n, n, folio, fsdata);
+		res = mapping_write_end(NULL, mapping, pos, n, n, folio, fsdata);
 		if (res < 0)
 			return res;
 		if (res != n)
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 19b67828ae325..87c26f0571dab 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -2710,7 +2710,6 @@ static ssize_t f2fs_quota_write(struct super_block *sb, int type,
 {
 	struct inode *inode = sb_dqopt(sb)->files[type];
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
 	int offset = off & (sb->s_blocksize - 1);
 	size_t towrite = len;
 	struct folio *folio;
@@ -2722,7 +2721,7 @@ static ssize_t f2fs_quota_write(struct super_block *sb, int type,
 		tocopy = min_t(unsigned long, sb->s_blocksize - offset,
 								towrite);
 retry:
-		err = a_ops->write_begin(NULL, mapping, off, tocopy,
+		err = mapping_write_begin(NULL, mapping, off, tocopy,
 							&folio, &fsdata);
 		if (unlikely(err)) {
 			if (err == -ENOMEM) {
@@ -2735,7 +2734,7 @@ static ssize_t f2fs_quota_write(struct super_block *sb, int type,
 
 		memcpy_to_folio(folio, offset_in_folio(folio, off), data, tocopy);
 
-		a_ops->write_end(NULL, mapping, off, tocopy, tocopy,
+		mapping_write_end(NULL, mapping, off, tocopy, tocopy,
 						folio, fsdata);
 		offset = 0;
 		towrite -= tocopy;
diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
index 2287f238ae09e..b232589546d39 100644
--- a/fs/f2fs/verity.c
+++ b/fs/f2fs/verity.c
@@ -72,7 +72,6 @@ static int pagecache_write(struct inode *inode, const void *buf, size_t count,
 			   loff_t pos)
 {
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 
 	if (pos + count > F2FS_BLK_TO_BYTES(max_file_blocks(inode)))
 		return -EFBIG;
@@ -84,13 +83,13 @@ static int pagecache_write(struct inode *inode, const void *buf, size_t count,
 		void *fsdata = NULL;
 		int res;
 
-		res = aops->write_begin(NULL, mapping, pos, n, &folio, &fsdata);
+		res = mapping_write_begin(NULL, mapping, pos, n, &folio, &fsdata);
 		if (res)
 			return res;
 
 		memcpy_to_folio(folio, offset_in_folio(folio, pos), buf, n);
 
-		res = aops->write_end(NULL, mapping, pos, n, n, folio, fsdata);
+		res = mapping_write_end(NULL, mapping, pos, n, n, folio, fsdata);
 		if (res < 0)
 			return res;
 		if (res != n)
diff --git a/fs/namei.c b/fs/namei.c
index 3ab9440c5b931..e1c6d28c560da 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -5409,7 +5409,6 @@ EXPORT_SYMBOL(page_readlink);
 int page_symlink(struct inode *inode, const char *symname, int len)
 {
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 	bool nofs = !mapping_gfp_constraint(mapping, __GFP_FS);
 	struct folio *folio;
 	void *fsdata = NULL;
@@ -5419,7 +5418,7 @@ int page_symlink(struct inode *inode, const char *symname, int len)
 retry:
 	if (nofs)
 		flags = memalloc_nofs_save();
-	err = aops->write_begin(NULL, mapping, 0, len-1, &folio, &fsdata);
+	err = mapping_write_begin(NULL, mapping, 0, len-1, &folio, &fsdata);
 	if (nofs)
 		memalloc_nofs_restore(flags);
 	if (err)
@@ -5427,7 +5426,7 @@ int page_symlink(struct inode *inode, const char *symname, int len)
 
 	memcpy(folio_address(folio), symname, len - 1);
 
-	err = aops->write_end(NULL, mapping, 0, len - 1, len - 1,
+	err = mapping_write_end(NULL, mapping, 0, len - 1, len - 1,
 						folio, fsdata);
 	if (err < 0)
 		goto fail;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c3b2f8a621f7..820ff4752249e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -531,6 +531,24 @@ struct address_space {
 #define PAGECACHE_TAG_WRITEBACK	XA_MARK_1
 #define PAGECACHE_TAG_TOWRITE	XA_MARK_2
 
+static inline int mapping_write_begin(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len,
+				struct folio **foliop, void **fsdata)
+{
+	return mapping->a_ops->write_begin(file, mapping, pos, len, foliop,
+			fsdata);
+}
+
+static inline int mapping_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct folio *folio, void *fsdata)
+{
+	return mapping->a_ops->write_end(file, mapping, pos, len, copied,
+			folio, fsdata);
+}
+
 /*
  * Returns true if any of the pages in the mapping are marked with the tag.
  */
diff --git a/mm/filemap.c b/mm/filemap.c
index 804d7365680c1..6a1f90ddaaf08 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -4152,7 +4152,6 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 	struct file *file = iocb->ki_filp;
 	loff_t pos = iocb->ki_pos;
 	struct address_space *mapping = file->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
 	size_t chunk = mapping_max_folio_size(mapping);
 	long status = 0;
 	ssize_t written = 0;
@@ -4186,7 +4185,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 			break;
 		}
 
-		status = a_ops->write_begin(file, mapping, pos, bytes,
+		status = mapping_write_begin(file, mapping, pos, bytes,
 						&folio, &fsdata);
 		if (unlikely(status < 0))
 			break;
@@ -4201,7 +4200,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
 		flush_dcache_folio(folio);
 
-		status = a_ops->write_end(file, mapping, pos, bytes, copied,
+		status = mapping_write_end(file, mapping, pos, bytes, copied,
 						folio, fsdata);
 		if (unlikely(status != copied)) {
 			iov_iter_revert(i, copied - max(status, 0L));
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 16/25] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (13 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 15/25] fs, filemap: refactor to gather the scattered ->write_{begin,end}() calls Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 17/25] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done Byungchul Park
                             ` (8 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read-only and were unmapped, as
long as the contents of the folios don't change while staying in pcp or
buddy so we can still read the data through the stale tlb entries.

tlb flush can be defered when folios get unmapped as long as it
guarantees to perform tlb flush needed, before the folios actually
become used, of course, only if all the corresponding ptes don't have
write permission.  Otherwise, the system will get messed up.

To achieve that, for the folios that map only to non-writable tlb
entries, prevent tlb flush during unmapping but perform it just before
the folios actually become used, out of buddy or pcp.

However, we should cancel the pending by LUF and perform the deferred
TLB flush right away when:

   1. a writable pte is newly set through fault handler
   2. a file is updated
   3. kasan needs poisoning on free
   4. the kernel wants to init pages on free

No matter what type of workload is used for performance evaluation, the
result would be positive thanks to the unconditional reduction of tlb
flushes, tlb misses and interrupts.  For the test, I picked up one of
the most popular and heavy workload, llama.cpp that is a
LLM(Large Language Model) inference engine.

The result would depend on memory latency and how often reclaim runs,
which implies tlb miss overhead and how many times unmapping happens.
In my system, the result shows:

   1. tlb shootdown interrupts are reduced about 97%.
   2. The test program runtime is reduced about 4.5%.

The test environment and the test set are like:

   Machine: bare metal, x86_64, Intel(R) Xeon(R) Gold 6430
   CPU: 1 socket 64 core with hyper thread on
   Numa: 2 nodes (64 CPUs DRAM 42GB, no CPUs CXL expander 98GB)
   Config: swap off, numa balancing tiering on, demotion enabled

   llama.cpp/main -m $(70G_model1) -p "who are you?" -s 1 -t 15 -n 20 &
   llama.cpp/main -m $(70G_model2) -p "who are you?" -s 1 -t 15 -n 20 &
   llama.cpp/main -m $(70G_model3) -p "who are you?" -s 1 -t 15 -n 20 &
   wait

   where,
   -t: nr of threads, -s: seed used to make the runtime stable,
   -n: nr of tokens that determines the runtime, -p: prompt to ask,
   -m: LLM model to use.

Run the test set 5 times successively with caches dropped every run via
'echo 3 > /proc/sys/vm/drop_caches'.  Each inference prints its runtime
at the end of each.  The results are like:

   1. Runtime from the output of llama.cpp

   BEFORE
   ------
   llama_print_timings:       total time =  883450.54 ms /    24 tokens
   llama_print_timings:       total time =  861665.91 ms /    24 tokens
   llama_print_timings:       total time =  898079.02 ms /    24 tokens
   llama_print_timings:       total time =  879897.69 ms /    24 tokens
   llama_print_timings:       total time =  892360.75 ms /    24 tokens
   llama_print_timings:       total time =  884587.85 ms /    24 tokens
   llama_print_timings:       total time =  861023.19 ms /    24 tokens
   llama_print_timings:       total time =  900022.18 ms /    24 tokens
   llama_print_timings:       total time =  878771.88 ms /    24 tokens
   llama_print_timings:       total time =  889027.98 ms /    24 tokens
   llama_print_timings:       total time =  880783.90 ms /    24 tokens
   llama_print_timings:       total time =  856475.29 ms /    24 tokens
   llama_print_timings:       total time =  896842.21 ms /    24 tokens
   llama_print_timings:       total time =  878883.53 ms /    24 tokens
   llama_print_timings:       total time =  890122.10 ms /    24 tokens

   AFTER
   -----
   llama_print_timings:       total time =  871060.86 ms /    24 tokens
   llama_print_timings:       total time =  825609.53 ms /    24 tokens
   llama_print_timings:       total time =  836854.81 ms /    24 tokens
   llama_print_timings:       total time =  843147.99 ms /    24 tokens
   llama_print_timings:       total time =  831426.65 ms /    24 tokens
   llama_print_timings:       total time =  873939.23 ms /    24 tokens
   llama_print_timings:       total time =  826127.69 ms /    24 tokens
   llama_print_timings:       total time =  835489.26 ms /    24 tokens
   llama_print_timings:       total time =  842589.62 ms /    24 tokens
   llama_print_timings:       total time =  833700.66 ms /    24 tokens
   llama_print_timings:       total time =  875996.19 ms /    24 tokens
   llama_print_timings:       total time =  826401.73 ms /    24 tokens
   llama_print_timings:       total time =  839341.28 ms /    24 tokens
   llama_print_timings:       total time =  841075.10 ms /    24 tokens
   llama_print_timings:       total time =  835136.41 ms /    24 tokens

   2. tlb shootdowns from 'cat /proc/interrupts'

   BEFORE
   ------
   TLB:
    80911532   93691786  100296251  111062810  109769109  109862429
   108968588  119175230  115779676  118377498  119325266  120300143
   124514185  116697222  121068466  118031913  122660681  117494403
   121819907  116960596  120936335  117217061  118630217  122322724
   119595577  111693298  119232201  120030377  115334687  113179982
   118808254  116353592  140987367  137095516  131724276  139742240
   136501150  130428761  127585535  132483981  133430250  133756207
   131786710  126365824  129812539  133850040  131742690  125142213
   128572830  132234350  131945922  128417707  133355434  129972846
   126331823  134050849  133991626  121129038  124637283  132830916
   126875507  122322440  125776487  124340278   TLB shootdowns

   AFTER
   -----
   TLB:
     2121206    2615108    2983494    2911950    3055086    3092672
     3204894    3346082    3286744    3307310    3357296    3315940
     3428034    3112596    3143325    3185551    3186493    3322314
     3330523    3339663    3156064    3272070    3296309    3198962
     3332662    3315870    3234467    3353240    3281234    3300666
     3345452    3173097    4009196    3932215    3898735    3726531
     3717982    3671726    3728788    3724613    3799147    3691764
     3620630    3684655    3666688    3393974    3448651    3487593
     3446357    3618418    3671920    3712949    3575264    3715385
     3641513    3630897    3691047    3630690    3504933    3662647
     3629926    3443044    3832970    3548813   TLB shootdowns

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/asm-generic/tlb.h |   5 ++
 include/linux/fs.h        |  12 +++-
 include/linux/mm_types.h  |   6 ++
 include/linux/sched.h     |   9 +++
 kernel/sched/core.c       |   1 +
 mm/internal.h             |  94 ++++++++++++++++++++++++-
 mm/memory.c               |  15 ++++
 mm/pgtable-generic.c      |   2 +
 mm/rmap.c                 | 141 +++++++++++++++++++++++++++++++++++---
 mm/truncate.c             |  55 +++++++++++++--
 mm/vmscan.c               |  12 +++-
 11 files changed, 333 insertions(+), 19 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index e402aef79c93e..5bb6b05bd3549 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -565,6 +565,11 @@ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *
 
 static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
+	/*
+	 * Don't leave stale tlb entries for this vma.
+	 */
+	luf_flush(0);
+
 	if (tlb->fullmm)
 		return;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 820ff4752249e..78aaf769d32d1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -536,8 +536,18 @@ static inline int mapping_write_begin(struct file *file,
 				loff_t pos, unsigned len,
 				struct folio **foliop, void **fsdata)
 {
-	return mapping->a_ops->write_begin(file, mapping, pos, len, foliop,
+	int ret;
+
+	ret = mapping->a_ops->write_begin(file, mapping, pos, len, foliop,
 			fsdata);
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	if (!ret)
+		luf_flush(0);
+
+	return ret;
 }
 
 static inline int mapping_write_end(struct file *file,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4bfe8d072b0ea..cb9e6282b7ad1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1339,6 +1339,12 @@ extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_finish_mmu(struct mmu_gather *tlb);
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+void luf_flush(unsigned short luf_key);
+#else
+static inline void luf_flush(unsigned short luf_key) {}
+#endif
+
 struct vm_fault;
 
 /**
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d1a3c97491ff2..47a0a3ccb7b1a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1408,6 +1408,15 @@ struct task_struct {
 	struct tlbflush_unmap_batch	tlb_ubc;
 	struct tlbflush_unmap_batch	tlb_ubc_takeoff;
 	struct tlbflush_unmap_batch	tlb_ubc_ro;
+	struct tlbflush_unmap_batch	tlb_ubc_luf;
+
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+	/*
+	 * whether all the mappings of a folio during unmap are read-only
+	 * so that luf can work on the folio
+	 */
+	bool				can_luf;
+#endif
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9aecd914ac691..1f4c5da800365 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5275,6 +5275,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	if (mm) {
 		membarrier_mm_sync_core_before_usermode(mm);
 		mmdrop_lazy_tlb_sched(mm);
+		luf_flush(0);
 	}
 
 	if (unlikely(prev_state == TASK_DEAD)) {
diff --git a/mm/internal.h b/mm/internal.h
index 9dbb65f919337..43e91f04d6d1c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1601,13 +1601,105 @@ static inline bool non_luf_pages_ok(struct zone *zone)
 
 	return nr_free - nr_luf_pages > min_wm;
 }
-#else
+
+unsigned short fold_unmap_luf(void);
+
+/*
+ * Reset the indicator indicating there are no writable mappings at the
+ * beginning of every rmap traverse for unmap.  luf can work only when
+ * all the mappings are read-only.
+ */
+static inline void can_luf_init(struct folio *f)
+{
+	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC))
+		current->can_luf = false;
+	/*
+	 * Pages might get updated inside buddy.
+	 */
+	else if (want_init_on_free())
+		current->can_luf = false;
+	/*
+	 * Pages might get updated inside buddy.
+	 */
+	else if (!should_skip_kasan_poison(folio_page(f, 0)))
+		current->can_luf = false;
+	/*
+	 * XXX: Remove the constraint once luf handles zone device folio.
+	 */
+	else if (unlikely(folio_is_zone_device(f)))
+		current->can_luf = false;
+	/*
+	 * XXX: Remove the constraint once luf handles hugetlb folio.
+	 */
+	else if (unlikely(folio_test_hugetlb(f)))
+		current->can_luf = false;
+	/*
+	 * XXX: Remove the constraint once luf handles large folio.
+	 */
+	else if (unlikely(folio_test_large(f)))
+		current->can_luf = false;
+	/*
+	 * Can track write of anon folios through fault handler.
+	 */
+	else if (folio_test_anon(f))
+		current->can_luf = true;
+	/*
+	 * Can track write of file folios through page cache or truncation.
+	 */
+	else if (folio_mapping(f))
+		current->can_luf = true;
+	/*
+	 * For niehter anon nor file folios, do not apply luf.
+	 */
+	else
+		current->can_luf = false;
+}
+
+/*
+ * Mark the folio is not applicable to luf once it found a writble or
+ * dirty pte during rmap traverse for unmap.
+ */
+static inline void can_luf_fail(void)
+{
+	current->can_luf = false;
+}
+
+/*
+ * Check if all the mappings are read-only.
+ */
+static inline bool can_luf_test(void)
+{
+	return current->can_luf;
+}
+
+static inline bool can_luf_vma(struct vm_area_struct *vma)
+{
+	/*
+	 * Shared region requires a medium like file to keep all the
+	 * associated mm_struct.  luf makes use of strcut address_space
+	 * for that purpose.
+	 */
+	if (vma->vm_flags & VM_SHARED)
+		return !!vma->vm_file;
+
+	/*
+	 * Private region can be handled through its mm_struct.
+	 */
+	return true;
+}
+#else /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 static inline bool luf_takeoff_start(void) { return false; }
 static inline void luf_takeoff_end(void) {}
 static inline bool luf_takeoff_no_shootdown(void) { return true; }
 static inline bool luf_takeoff_check(struct page *page) { return true; }
 static inline bool luf_takeoff_check_and_fold(struct page *page) { return true; }
 static inline bool non_luf_pages_ok(struct zone *zone) { return true; }
+static inline unsigned short fold_unmap_luf(void) { return 0; }
+
+static inline void can_luf_init(struct folio *f) {}
+static inline void can_luf_fail(void) {}
+static inline bool can_luf_test(void) { return false; }
+static inline bool can_luf_vma(struct vm_area_struct *vma) { return false; }
 #endif
 
 /* pagewalk.c */
diff --git a/mm/memory.c b/mm/memory.c
index b4d3d4893267c..c1d2d2b0112cd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6181,6 +6181,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	struct mm_struct *mm = vma->vm_mm;
 	vm_fault_t ret;
 	bool is_droppable;
+	bool flush = false;
 
 	__set_current_state(TASK_RUNNING);
 
@@ -6206,6 +6207,14 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 
 	lru_gen_enter_fault(vma);
 
+	/*
+	 * Any potential cases that make pte writable even forcely
+	 * should be considered.
+	 */
+	if (vma->vm_flags & (VM_WRITE | VM_MAYWRITE) ||
+			flags & FAULT_FLAG_WRITE)
+		flush = true;
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
 	else
@@ -6237,6 +6246,12 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 out:
 	mm_account_fault(mm, regs, address, flags, ret);
 
+	/*
+	 * Ensure to clean stale tlb entries for this vma.
+	 */
+	if (flush)
+		luf_flush(0);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 5a882f2b10f90..d6678d6bac746 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -99,6 +99,8 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 	pte = ptep_get_and_clear(mm, address, ptep);
 	if (pte_accessible(mm, pte))
 		flush_tlb_page(vma, address);
+	else
+		luf_flush(0);
 	return pte;
 }
 #endif
diff --git a/mm/rmap.c b/mm/rmap.c
index 3ed6234dd777e..c3df36cf7ac16 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -646,7 +646,7 @@ static atomic_long_t luf_ugen = ATOMIC_LONG_INIT(LUF_UGEN_INIT);
 /*
  * Don't return invalid luf_ugen, zero.
  */
-static unsigned long __maybe_unused new_luf_ugen(void)
+static unsigned long new_luf_ugen(void)
 {
 	unsigned long ugen = atomic_long_inc_return(&luf_ugen);
 
@@ -723,7 +723,7 @@ static atomic_t luf_kgen = ATOMIC_INIT(1);
 /*
  * Don't return invalid luf_key, zero.
  */
-static unsigned short __maybe_unused new_luf_key(void)
+static unsigned short new_luf_key(void)
 {
 	unsigned short luf_key = atomic_inc_return(&luf_kgen);
 
@@ -776,6 +776,7 @@ void try_to_unmap_flush_takeoff(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 
 	if (!tlb_ubc_takeoff->flush_required)
@@ -793,9 +794,72 @@ void try_to_unmap_flush_takeoff(void)
 	if (arch_tlbbatch_done(&tlb_ubc_ro->arch, &tlb_ubc_takeoff->arch))
 		reset_batch(tlb_ubc_ro);
 
+	if (arch_tlbbatch_done(&tlb_ubc_luf->arch, &tlb_ubc_takeoff->arch))
+		reset_batch(tlb_ubc_luf);
+
 	reset_batch(tlb_ubc_takeoff);
 }
 
+/*
+ * Should be called just before try_to_unmap_flush() to optimize the tlb
+ * shootdown using arch_tlbbatch_done().
+ */
+unsigned short fold_unmap_luf(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
+	struct luf_batch *lb;
+	unsigned long new_ugen;
+	unsigned short new_key;
+	unsigned long flags;
+
+	if (!tlb_ubc_luf->flush_required)
+		return 0;
+
+	/*
+	 * fold_unmap_luf() is always followed by try_to_unmap_flush().
+	 */
+	if (arch_tlbbatch_done(&tlb_ubc_luf->arch, &tlb_ubc->arch)) {
+		tlb_ubc_luf->flush_required = false;
+		tlb_ubc_luf->writable = false;
+	}
+
+	/*
+	 * Check again after shrinking.
+	 */
+	if (!tlb_ubc_luf->flush_required)
+		return 0;
+
+	new_ugen = new_luf_ugen();
+	new_key = new_luf_key();
+
+	/*
+	 * Update the next entry of luf_batch table, that is the oldest
+	 * entry among the candidate, hopefully tlb flushes have been
+	 * done for all of the CPUs.
+	 */
+	lb = &luf_batch[new_key];
+	write_lock_irqsave(&lb->lock, flags);
+	__fold_luf_batch(lb, tlb_ubc_luf, new_ugen);
+	write_unlock_irqrestore(&lb->lock, flags);
+
+	reset_batch(tlb_ubc_luf);
+	return new_key;
+}
+
+void luf_flush(unsigned short luf_key)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct luf_batch *lb = &luf_batch[luf_key];
+	unsigned long flags;
+
+	read_lock_irqsave(&lb->lock, flags);
+	fold_batch(tlb_ubc, &lb->batch, false);
+	read_unlock_irqrestore(&lb->lock, flags);
+	try_to_unmap_flush();
+}
+EXPORT_SYMBOL(luf_flush);
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
@@ -806,8 +870,10 @@ void try_to_unmap_flush(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 
 	fold_batch(tlb_ubc, tlb_ubc_ro, true);
+	fold_batch(tlb_ubc, tlb_ubc_luf, true);
 	if (!tlb_ubc->flush_required)
 		return;
 
@@ -820,8 +886,9 @@ void try_to_unmap_flush_dirty(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 
-	if (tlb_ubc->writable || tlb_ubc_ro->writable)
+	if (tlb_ubc->writable || tlb_ubc_ro->writable || tlb_ubc_luf->writable)
 		try_to_unmap_flush();
 }
 
@@ -836,7 +903,8 @@ void try_to_unmap_flush_dirty(void)
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
-				      unsigned long uaddr)
+		unsigned long uaddr,
+		struct vm_area_struct *vma)
 {
 	struct tlbflush_unmap_batch *tlb_ubc;
 	int batch;
@@ -845,7 +913,16 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 	if (!pte_accessible(mm, pteval))
 		return;
 
-	if (pte_write(pteval))
+	if (can_luf_test()) {
+		/*
+		 * luf cannot work with the folio once it found a
+		 * writable or dirty mapping on it.
+		 */
+		if (pte_write(pteval) || !can_luf_vma(vma))
+			can_luf_fail();
+	}
+
+	if (!can_luf_test())
 		tlb_ubc = &current->tlb_ubc;
 	else
 		tlb_ubc = &current->tlb_ubc_ro;
@@ -853,6 +930,21 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
+	if (can_luf_test()) {
+		struct luf_batch *lb;
+		unsigned long flags;
+
+		/*
+		 * Accumulate to the 0th entry right away so that
+		 * luf_flush(0) can be uesed to properly perform pending
+		 * TLB flush once this unmapping is observed.
+		 */
+		lb = &luf_batch[0];
+		write_lock_irqsave(&lb->lock, flags);
+		__fold_luf_batch(lb, tlb_ubc, new_luf_ugen());
+		write_unlock_irqrestore(&lb->lock, flags);
+	}
+
 	/*
 	 * Ensure compiler does not re-order the setting of tlb_flush_batched
 	 * before the PTE is cleared.
@@ -907,6 +999,8 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
  * This must be called under the PTL so that an access to tlb_flush_batched
  * that is potentially a "reclaim vs mprotect/munmap/etc" race will synchronise
  * via the PTL.
+ *
+ * LUF(Lazy Unmap Flush) also relies on this for mprotect/munmap/etc.
  */
 void flush_tlb_batched_pending(struct mm_struct *mm)
 {
@@ -916,6 +1010,7 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 
 	if (pending != flushed) {
 		arch_flush_tlb_batched_pending(mm);
+
 		/*
 		 * If the new TLB flushing is pending during flushing, leave
 		 * mm->tlb_flush_batched as is, to avoid losing flushing.
@@ -926,7 +1021,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 }
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
-				      unsigned long uaddr)
+		unsigned long uaddr,
+		struct vm_area_struct *vma)
 {
 }
 
@@ -1292,6 +1388,11 @@ int folio_mkclean(struct folio *folio)
 
 	rmap_walk(folio, &rwc);
 
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
+
 	return cleaned;
 }
 EXPORT_SYMBOL_GPL(folio_mkclean);
@@ -1961,7 +2062,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pteval, address);
+				set_tlb_ubc_flush_pending(mm, pteval, address, vma);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
@@ -2132,6 +2233,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 	mmu_notifier_invalidate_range_end(&range);
 
+	if (!ret)
+		can_luf_fail();
 	return ret;
 }
 
@@ -2164,11 +2267,21 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags)
 		.done = folio_not_mapped,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
+
+	can_luf_init(folio);
 
 	if (flags & TTU_RMAP_LOCKED)
 		rmap_walk_locked(folio, &rwc);
 	else
 		rmap_walk(folio, &rwc);
+
+	if (can_luf_test())
+		fold_batch(tlb_ubc_luf, tlb_ubc_ro, true);
+	else
+		fold_batch(tlb_ubc, tlb_ubc_ro, true);
 }
 
 /*
@@ -2338,7 +2451,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pteval, address);
+				set_tlb_ubc_flush_pending(mm, pteval, address, vma);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
@@ -2494,6 +2607,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 
 	mmu_notifier_invalidate_range_end(&range);
 
+	if (!ret)
+		can_luf_fail();
 	return ret;
 }
 
@@ -2513,6 +2628,9 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 		.done = folio_not_mapped,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 
 	/*
 	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
@@ -2537,10 +2655,17 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 	if (!folio_test_ksm(folio) && folio_test_anon(folio))
 		rwc.invalid_vma = invalid_migration_vma;
 
+	can_luf_init(folio);
+
 	if (flags & TTU_RMAP_LOCKED)
 		rmap_walk_locked(folio, &rwc);
 	else
 		rmap_walk(folio, &rwc);
+
+	if (can_luf_test())
+		fold_batch(tlb_ubc_luf, tlb_ubc_ro, true);
+	else
+		fold_batch(tlb_ubc, tlb_ubc_ro, true);
 }
 
 #ifdef CONFIG_DEVICE_PRIVATE
diff --git a/mm/truncate.c b/mm/truncate.c
index e2e115adfbc58..2bf3806391c21 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -124,6 +124,11 @@ void folio_invalidate(struct folio *folio, size_t offset, size_t length)
 
 	if (aops->invalidate_folio)
 		aops->invalidate_folio(folio, offset, length);
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 }
 EXPORT_SYMBOL_GPL(folio_invalidate);
 
@@ -160,6 +165,11 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
 
 	truncate_cleanup_folio(folio);
 	filemap_remove_folio(folio);
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 	return 0;
 }
 
@@ -205,6 +215,12 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 
 	if (folio_needs_release(folio))
 		folio_invalidate(folio, offset, length);
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
+
 	if (!folio_test_large(folio))
 		return true;
 	if (split_folio(folio) == 0)
@@ -246,19 +262,28 @@ EXPORT_SYMBOL(generic_error_remove_folio);
  */
 long mapping_evict_folio(struct address_space *mapping, struct folio *folio)
 {
+	long ret = 0;
+
 	/* The page may have been truncated before it was locked */
 	if (!mapping)
-		return 0;
+		goto out;
 	if (folio_test_dirty(folio) || folio_test_writeback(folio))
-		return 0;
+		goto out;
 	/* The refcount will be elevated if any page in the folio is mapped */
 	if (folio_ref_count(folio) >
 			folio_nr_pages(folio) + folio_has_private(folio) + 1)
-		return 0;
+		goto out;
 	if (!filemap_release_folio(folio, 0))
-		return 0;
+		goto out;
 
-	return remove_mapping(mapping, folio);
+	ret = remove_mapping(mapping, folio);
+out:
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
+
+	return ret;
 }
 
 /**
@@ -298,7 +323,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	bool		same_folio;
 
 	if (mapping_empty(mapping))
-		return;
+		goto out;
 
 	/*
 	 * 'start' and 'end' always covers the range of pages to be fully
@@ -386,6 +411,12 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		truncate_folio_batch_exceptionals(mapping, &fbatch, indices);
 		folio_batch_release(&fbatch);
 	}
+
+out:
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
 
@@ -501,6 +532,11 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 		folio_batch_release(&fbatch);
 		cond_resched();
 	}
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 	return count;
 }
 
@@ -605,7 +641,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	int did_range_unmap = 0;
 
 	if (mapping_empty(mapping))
-		return 0;
+		goto out;
 
 	folio_batch_init(&fbatch);
 	index = start;
@@ -666,6 +702,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	if (dax_mapping(mapping)) {
 		unmap_mapping_pages(mapping, start, end - start + 1, false);
 	}
+out:
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ff1c53e769398..461e7643898e7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -828,6 +828,8 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
  */
 long remove_mapping(struct address_space *mapping, struct folio *folio)
 {
+	long ret = 0;
+
 	if (__remove_mapping(mapping, folio, false, NULL)) {
 		/*
 		 * Unfreezing the refcount with 1 effectively
@@ -835,9 +837,15 @@ long remove_mapping(struct address_space *mapping, struct folio *folio)
 		 * atomic operation.
 		 */
 		folio_ref_unfreeze(folio, 1);
-		return folio_nr_pages(folio);
+		ret = folio_nr_pages(folio);
 	}
-	return 0;
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
+
+	return ret;
 }
 
 /**
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 17/25] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (14 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 16/25] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 18/25] mm/page_alloc: retry 3 times to take pcp pages on luf check failure Byungchul Park
                             ` (7 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

luf mechanism performs tlb shootdown for mappings that have been
unmapped in lazy manner.  However, it doesn't have to perform tlb
shootdown to cpus that already have been done by others since the tlb
shootdown was desired.

Since luf already introduced its own generation number used as a global
timestamp, luf_ugen, it's possible to selectively pick cpus that have
been done tlb flush required.

This patch introduced APIs that use the generation number to select and
remove those cpus so that it can perform tlb shootdown with a smaller
cpumask, for all the CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH archs,
x86, riscv, and arm64.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/arm64/include/asm/tlbflush.h |  26 +++++++
 arch/riscv/include/asm/tlbflush.h |   4 ++
 arch/riscv/mm/tlbflush.c          | 108 ++++++++++++++++++++++++++++++
 arch/x86/include/asm/tlbflush.h   |   4 ++
 arch/x86/mm/tlb.c                 | 108 ++++++++++++++++++++++++++++++
 include/linux/sched.h             |   1 +
 mm/internal.h                     |   4 ++
 mm/page_alloc.c                   |  32 +++++++--
 mm/rmap.c                         |  46 ++++++++++++-
 9 files changed, 327 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index acac53a21e5d1..5547ab1ffb3c3 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -354,6 +354,32 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	dsb(ish);
 }
 
+static inline bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen)
+{
+	/*
+	 * Nothing is needed in this architecture.
+	 */
+	return true;
+}
+
+static inline bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen)
+{
+	/*
+	 * Nothing is needed in this architecture.
+	 */
+	return true;
+}
+
+static inline void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen)
+{
+	/* nothing to do */
+}
+
+static inline void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
+{
+	/* nothing to do */
+}
+
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
 	/* nothing to do */
diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index 1dc7d30273d59..ec5caeb3cf8ef 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -65,6 +65,10 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 			       unsigned long uaddr);
 void arch_flush_tlb_batched_pending(struct mm_struct *mm);
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen);
 
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 36f996af6256c..93afb7a299003 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -202,3 +202,111 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	__flush_tlb_range(&batch->cpumask, FLUSH_TLB_NO_ASID, 0,
 			  FLUSH_TLB_MAX_SIZE, PAGE_SIZE);
 }
+
+static DEFINE_PER_CPU(atomic_long_t, ugen_done);
+
+static int __init luf_init_arch(void)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpu_possible_mask)
+		atomic_long_set(per_cpu_ptr(&ugen_done, cpu), LUF_UGEN_INIT - 1);
+
+	return 0;
+}
+early_initcall(luf_init_arch);
+
+/*
+ * batch will not be updated.
+ */
+bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch,
+			unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		goto out;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		unsigned long done;
+
+		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
+		if (ugen_before(done, ugen))
+			return false;
+	}
+	return true;
+out:
+	return cpumask_empty(&batch->cpumask);
+}
+
+bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
+			unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		goto out;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		unsigned long done;
+
+		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
+		if (!ugen_before(done, ugen))
+			cpumask_clear_cpu(cpu, &batch->cpumask);
+	}
+out:
+	return cpumask_empty(&batch->cpumask);
+}
+
+void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
+			     unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		return;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
+		unsigned long old = atomic_long_read(done);
+
+		/*
+		 * It's racy.  The race results in unnecessary tlb flush
+		 * because of the smaller ugen_done than it should be.
+		 * However, it's okay in terms of correctness.
+		 */
+		if (!ugen_before(old, ugen))
+			continue;
+
+		/*
+		 * It's for optimization.  Just skip on fail than retry.
+		 */
+		atomic_long_cmpxchg(done, old, ugen);
+	}
+}
+
+void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
+		unsigned long old = atomic_long_read(done);
+
+		/*
+		 * It's racy.  The race results in unnecessary tlb flush
+		 * because of the smaller ugen_done than it should be.
+		 * However, it's okay in terms of correctness.
+		 */
+		if (!ugen_before(old, ugen))
+			continue;
+
+		/*
+		 * It's for optimization.  Just skip on fail than retry.
+		 */
+		atomic_long_cmpxchg(done, old, ugen);
+	}
+}
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index c27e61bd274a5..dbcbf0477ed2a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -294,6 +294,10 @@ static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 }
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+extern bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+extern void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+extern void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen);
 
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 523e8bb6fba1f..be6068b60c32d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1270,6 +1270,114 @@ void __flush_tlb_all(void)
 }
 EXPORT_SYMBOL_GPL(__flush_tlb_all);
 
+static DEFINE_PER_CPU(atomic_long_t, ugen_done);
+
+static int __init luf_init_arch(void)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpu_possible_mask)
+		atomic_long_set(per_cpu_ptr(&ugen_done, cpu), LUF_UGEN_INIT - 1);
+
+	return 0;
+}
+early_initcall(luf_init_arch);
+
+/*
+ * batch will not be updated.
+ */
+bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch,
+			unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		goto out;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		unsigned long done;
+
+		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
+		if (ugen_before(done, ugen))
+			return false;
+	}
+	return true;
+out:
+	return cpumask_empty(&batch->cpumask);
+}
+
+bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
+			unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		goto out;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		unsigned long done;
+
+		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
+		if (!ugen_before(done, ugen))
+			cpumask_clear_cpu(cpu, &batch->cpumask);
+	}
+out:
+	return cpumask_empty(&batch->cpumask);
+}
+
+void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
+			     unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		return;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
+		unsigned long old = atomic_long_read(done);
+
+		/*
+		 * It's racy.  The race results in unnecessary tlb flush
+		 * because of the smaller ugen_done than it should be.
+		 * However, it's okay in terms of correctness.
+		 */
+		if (!ugen_before(old, ugen))
+			continue;
+
+		/*
+		 * It's for optimization.  Just skip on fail than retry.
+		 */
+		atomic_long_cmpxchg(done, old, ugen);
+	}
+}
+
+void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
+		unsigned long old = atomic_long_read(done);
+
+		/*
+		 * It's racy.  The race results in unnecessary tlb flush
+		 * because of the smaller ugen_done than it should be.
+		 * However, it's okay in terms of correctness.
+		 */
+		if (!ugen_before(old, ugen))
+			continue;
+
+		/*
+		 * It's for optimization.  Just skip on fail than retry.
+		 */
+		atomic_long_cmpxchg(done, old, ugen);
+	}
+}
+
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	struct flush_tlb_info *info;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 47a0a3ccb7b1a..31efc88ce911a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1403,6 +1403,7 @@ struct task_struct {
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 	int luf_no_shootdown;
 	int luf_takeoff_started;
+	unsigned long luf_ugen;
 #endif
 
 	struct tlbflush_unmap_batch	tlb_ubc;
diff --git a/mm/internal.h b/mm/internal.h
index 43e91f04d6d1c..a95c46355e93d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1259,6 +1259,7 @@ void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
 void try_to_unmap_flush_takeoff(void);
 void flush_tlb_batched_pending(struct mm_struct *mm);
+void reset_batch(struct tlbflush_unmap_batch *batch);
 void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset);
 void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src);
 #else
@@ -1274,6 +1275,9 @@ static inline void try_to_unmap_flush_takeoff(void)
 static inline void flush_tlb_batched_pending(struct mm_struct *mm)
 {
 }
+static inline void reset_batch(struct tlbflush_unmap_batch *batch)
+{
+}
 static inline void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset)
 {
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index db1460c07b964..8e1ed80f304cd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -668,9 +668,11 @@ bool luf_takeoff_start(void)
  */
 void luf_takeoff_end(void)
 {
+	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 	unsigned long flags;
 	bool no_shootdown;
 	bool outmost = false;
+	unsigned long cur_luf_ugen;
 
 	local_irq_save(flags);
 	VM_WARN_ON(!current->luf_takeoff_started);
@@ -697,10 +699,19 @@ void luf_takeoff_end(void)
 	if (no_shootdown)
 		goto out;
 
+	cur_luf_ugen = current->luf_ugen;
+
+	current->luf_ugen = 0;
+
+	if (cur_luf_ugen && arch_tlbbatch_diet(&tlb_ubc_takeoff->arch, cur_luf_ugen))
+		reset_batch(tlb_ubc_takeoff);
+
 	try_to_unmap_flush_takeoff();
 out:
-	if (outmost)
+	if (outmost) {
 		VM_WARN_ON(current->luf_no_shootdown);
+		VM_WARN_ON(current->luf_ugen);
+	}
 }
 
 /*
@@ -757,6 +768,7 @@ bool luf_takeoff_check_and_fold(struct page *page)
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 	unsigned short luf_key = page_luf_key(page);
 	struct luf_batch *lb;
+	unsigned long lb_ugen;
 	unsigned long flags;
 
 	/*
@@ -770,13 +782,25 @@ bool luf_takeoff_check_and_fold(struct page *page)
 	if (!luf_key)
 		return true;
 
-	if (current->luf_no_shootdown)
-		return false;
-
 	lb = &luf_batch[luf_key];
 	read_lock_irqsave(&lb->lock, flags);
+	lb_ugen = lb->ugen;
+
+	if (arch_tlbbatch_check_done(&lb->batch.arch, lb_ugen)) {
+		read_unlock_irqrestore(&lb->lock, flags);
+		return true;
+	}
+
+	if (current->luf_no_shootdown) {
+		read_unlock_irqrestore(&lb->lock, flags);
+		return false;
+	}
+
 	fold_batch(tlb_ubc_takeoff, &lb->batch, false);
 	read_unlock_irqrestore(&lb->lock, flags);
+
+	if (!current->luf_ugen || ugen_before(current->luf_ugen, lb_ugen))
+		current->luf_ugen = lb_ugen;
 	return true;
 }
 #endif
diff --git a/mm/rmap.c b/mm/rmap.c
index c3df36cf7ac16..fcd27200efa04 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -656,7 +656,7 @@ static unsigned long new_luf_ugen(void)
 	return ugen;
 }
 
-static void reset_batch(struct tlbflush_unmap_batch *batch)
+void reset_batch(struct tlbflush_unmap_batch *batch)
 {
 	arch_tlbbatch_clear(&batch->arch);
 	batch->flush_required = false;
@@ -743,8 +743,14 @@ static void __fold_luf_batch(struct luf_batch *dst_lb,
 	 * more tlb shootdown might be needed to fulfill the newer
 	 * request.  Conservertively keep the newer one.
 	 */
-	if (!dst_lb->ugen || ugen_before(dst_lb->ugen, src_ugen))
+	if (!dst_lb->ugen || ugen_before(dst_lb->ugen, src_ugen)) {
+		/*
+		 * Good chance to shrink the batch using the old ugen.
+		 */
+		if (dst_lb->ugen && arch_tlbbatch_diet(&dst_lb->batch.arch, dst_lb->ugen))
+			reset_batch(&dst_lb->batch);
 		dst_lb->ugen = src_ugen;
+	}
 	fold_batch(&dst_lb->batch, src_batch, false);
 }
 
@@ -772,17 +778,45 @@ void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
 	read_unlock_irqrestore(&src->lock, flags);
 }
 
+static unsigned long tlb_flush_start(void)
+{
+	/*
+	 * Memory barrier implied in the atomic operation prevents
+	 * reading luf_ugen from happening after the following
+	 * tlb flush.
+	 */
+	return new_luf_ugen();
+}
+
+static void tlb_flush_end(struct arch_tlbflush_unmap_batch *arch,
+		struct mm_struct *mm, unsigned long ugen)
+{
+	/*
+	 * Prevent the following marking from placing prior to the
+	 * actual tlb flush.
+	 */
+	smp_mb();
+
+	if (arch)
+		arch_tlbbatch_mark_ugen(arch, ugen);
+	if (mm)
+		arch_mm_mark_ugen(mm, ugen);
+}
+
 void try_to_unmap_flush_takeoff(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
+	unsigned long ugen;
 
 	if (!tlb_ubc_takeoff->flush_required)
 		return;
 
+	ugen = tlb_flush_start();
 	arch_tlbbatch_flush(&tlb_ubc_takeoff->arch);
+	tlb_flush_end(&tlb_ubc_takeoff->arch, NULL, ugen);
 
 	/*
 	 * Now that tlb shootdown of tlb_ubc_takeoff has been performed,
@@ -871,13 +905,17 @@ void try_to_unmap_flush(void)
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
+	unsigned long ugen;
 
 	fold_batch(tlb_ubc, tlb_ubc_ro, true);
 	fold_batch(tlb_ubc, tlb_ubc_luf, true);
 	if (!tlb_ubc->flush_required)
 		return;
 
+	ugen = tlb_flush_start();
 	arch_tlbbatch_flush(&tlb_ubc->arch);
+	tlb_flush_end(&tlb_ubc->arch, NULL, ugen);
+
 	reset_batch(tlb_ubc);
 }
 
@@ -1009,7 +1047,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 	int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
 
 	if (pending != flushed) {
+		unsigned long ugen;
+
+		ugen = tlb_flush_start();
 		arch_flush_tlb_batched_pending(mm);
+		tlb_flush_end(NULL, mm, ugen);
 
 		/*
 		 * If the new TLB flushing is pending during flushing, leave
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 18/25] mm/page_alloc: retry 3 times to take pcp pages on luf check failure
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (15 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 17/25] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 19/25] mm: skip luf tlb flush for luf'd mm that already has been done Byungchul Park
                             ` (6 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/page_alloc.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8e1ed80f304cd..811e7c4bd2d19 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3306,6 +3306,12 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 {
 	struct page *page;
 
+	/*
+	 * give up taking page from pcp if it fails to take pcp page
+	 * 3 times due to the tlb shootdownable issue.
+	 */
+	int try_luf_pages = 3;
+
 	do {
 		if (list_empty(list)) {
 			int batch = nr_pcp_alloc(pcp, zone, order);
@@ -3320,11 +3326,21 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 				return NULL;
 		}
 
-		page = list_first_entry(list, struct page, pcp_list);
-		if (!luf_takeoff_check_and_fold(page))
+		list_for_each_entry(page, list, pcp_list) {
+			if (luf_takeoff_check_and_fold(page)) {
+				list_del(&page->pcp_list);
+				pcp->count -= 1 << order;
+				break;
+			}
+			if (!--try_luf_pages)
+				return NULL;
+		}
+
+		/*
+		 * If all the pages in the list fails...
+		 */
+		if (list_entry_is_head(page, list, pcp_list))
 			return NULL;
-		list_del(&page->pcp_list);
-		pcp->count -= 1 << order;
 	} while (check_new_pages(page, order));
 
 	return page;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 19/25] mm: skip luf tlb flush for luf'd mm that already has been done
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (16 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 18/25] mm/page_alloc: retry 3 times to take pcp pages on luf check failure Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 20/25] mm, fs: skip tlb flushes for luf'd filemap " Byungchul Park
                             ` (5 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Fault hander performs tlb flush pended by luf when a new pte becomes
to have write permission, no matter whether tlb flush required has been
performed or not.

By storing luf generation number, luf_ugen, in struct mm_struct, we can
skip unnecessary tlb flush.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/asm-generic/tlb.h |  2 +-
 include/linux/mm_types.h  |  9 +++++
 kernel/fork.c             |  1 +
 kernel/sched/core.c       |  2 +-
 mm/memory.c               | 22 ++++++++++--
 mm/pgtable-generic.c      |  2 +-
 mm/rmap.c                 | 74 +++++++++++++++++++++++++++++++++++++--
 7 files changed, 104 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 5bb6b05bd3549..f156e8cb3bd4a 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -568,7 +568,7 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
 	/*
 	 * Don't leave stale tlb entries for this vma.
 	 */
-	luf_flush(0);
+	luf_flush_vma(vma);
 
 	if (tlb->fullmm)
 		return;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cb9e6282b7ad1..2ac93d4f67c15 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -38,8 +38,10 @@ struct luf_batch {
 	unsigned long ugen;
 	rwlock_t lock;
 };
+void luf_batch_init(struct luf_batch *lb);
 #else
 struct luf_batch {};
+static inline void luf_batch_init(struct luf_batch *lb) {}
 #endif
 
 /*
@@ -1059,6 +1061,9 @@ struct mm_struct {
 		 * moving a PROT_NONE mapped page.
 		 */
 		atomic_t tlb_flush_pending;
+
+		/* luf batch for this mm */
+		struct luf_batch luf_batch;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 		/* See flush_tlb_batched_pending() */
 		atomic_t tlb_flush_batched;
@@ -1341,8 +1346,12 @@ extern void tlb_finish_mmu(struct mmu_gather *tlb);
 
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 void luf_flush(unsigned short luf_key);
+void luf_flush_mm(struct mm_struct *mm);
+void luf_flush_vma(struct vm_area_struct *vma);
 #else
 static inline void luf_flush(unsigned short luf_key) {}
+static inline void luf_flush_mm(struct mm_struct *mm) {}
+static inline void luf_flush_vma(struct vm_area_struct *vma) {}
 #endif
 
 struct vm_fault;
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f32..ece87fece2113 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1280,6 +1280,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
 	spin_lock_init(&mm->arg_lock);
+	luf_batch_init(&mm->luf_batch);
 	mm_init_cpumask(mm);
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1f4c5da800365..ec132abbbce6e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5275,7 +5275,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	if (mm) {
 		membarrier_mm_sync_core_before_usermode(mm);
 		mmdrop_lazy_tlb_sched(mm);
-		luf_flush(0);
+		luf_flush_mm(mm);
 	}
 
 	if (unlikely(prev_state == TASK_DEAD)) {
diff --git a/mm/memory.c b/mm/memory.c
index c1d2d2b0112cd..52bd45fe00f55 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6181,6 +6181,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	struct mm_struct *mm = vma->vm_mm;
 	vm_fault_t ret;
 	bool is_droppable;
+	struct address_space *mapping = NULL;
 	bool flush = false;
 
 	__set_current_state(TASK_RUNNING);
@@ -6212,9 +6213,17 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	 * should be considered.
 	 */
 	if (vma->vm_flags & (VM_WRITE | VM_MAYWRITE) ||
-			flags & FAULT_FLAG_WRITE)
+			flags & FAULT_FLAG_WRITE) {
 		flush = true;
 
+		/*
+		 * Doesn't care the !VM_SHARED cases because it won't
+		 * update the pages that might be shared with others.
+		 */
+		if (vma->vm_flags & VM_SHARED && vma->vm_file)
+			mapping = vma->vm_file->f_mapping;
+	}
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
 	else
@@ -6249,8 +6258,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/*
 	 * Ensure to clean stale tlb entries for this vma.
 	 */
-	if (flush)
-		luf_flush(0);
+	if (flush) {
+		/*
+		 * If it has a VM_SHARED mapping, all the mms involved
+		 * should be luf_flush'ed.
+		 */
+		if (mapping)
+			luf_flush(0);
+		luf_flush_mm(mm);
+	}
 
 	return ret;
 }
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d6678d6bac746..545d401db82c1 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -100,7 +100,7 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 	if (pte_accessible(mm, pte))
 		flush_tlb_page(vma, address);
 	else
-		luf_flush(0);
+		luf_flush_vma(vma);
 	return pte;
 }
 #endif
diff --git a/mm/rmap.c b/mm/rmap.c
index fcd27200efa04..d68cfd28e0939 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -695,7 +695,7 @@ void fold_batch(struct tlbflush_unmap_batch *dst,
  */
 struct luf_batch luf_batch[NR_LUF_BATCH];
 
-static void luf_batch_init(struct luf_batch *lb)
+void luf_batch_init(struct luf_batch *lb)
 {
 	rwlock_init(&lb->lock);
 	reset_batch(&lb->batch);
@@ -778,6 +778,31 @@ void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
 	read_unlock_irqrestore(&src->lock, flags);
 }
 
+static void fold_luf_batch_mm(struct luf_batch *dst,
+		struct mm_struct *mm)
+{
+	unsigned long flags;
+	bool need_fold = false;
+
+	read_lock_irqsave(&dst->lock, flags);
+	if (arch_tlbbatch_need_fold(&dst->batch.arch, mm))
+		need_fold = true;
+	read_unlock(&dst->lock);
+
+	write_lock(&dst->lock);
+	if (unlikely(need_fold))
+		arch_tlbbatch_add_pending(&dst->batch.arch, mm, 0);
+
+	/*
+	 * dst->ugen represents sort of request for tlb shootdown.  The
+	 * newer it is, the more tlb shootdown might be needed to
+	 * fulfill the newer request.  Keep the newest one not to miss
+	 * necessary tlb shootdown.
+	 */
+	dst->ugen = new_luf_ugen();
+	write_unlock_irqrestore(&dst->lock, flags);
+}
+
 static unsigned long tlb_flush_start(void)
 {
 	/*
@@ -894,6 +919,49 @@ void luf_flush(unsigned short luf_key)
 }
 EXPORT_SYMBOL(luf_flush);
 
+void luf_flush_vma(struct vm_area_struct *vma)
+{
+	struct mm_struct *mm;
+	struct address_space *mapping = NULL;
+
+	if (!vma)
+		return;
+
+	mm = vma->vm_mm;
+	/*
+	 * Doesn't care the !VM_SHARED cases because it won't
+	 * update the pages that might be shared with others.
+	 */
+	if (vma->vm_flags & VM_SHARED && vma->vm_file)
+		mapping = vma->vm_file->f_mapping;
+
+	if (mapping)
+		luf_flush(0);
+	luf_flush_mm(mm);
+}
+
+void luf_flush_mm(struct mm_struct *mm)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct luf_batch *lb;
+	unsigned long flags;
+	unsigned long lb_ugen;
+
+	if (!mm)
+		return;
+
+	lb = &mm->luf_batch;
+	read_lock_irqsave(&lb->lock, flags);
+	fold_batch(tlb_ubc, &lb->batch, false);
+	lb_ugen = lb->ugen;
+	read_unlock_irqrestore(&lb->lock, flags);
+
+	if (arch_tlbbatch_diet(&tlb_ubc->arch, lb_ugen))
+		return;
+
+	try_to_unmap_flush();
+}
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
@@ -962,8 +1030,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 
 	if (!can_luf_test())
 		tlb_ubc = &current->tlb_ubc;
-	else
+	else {
 		tlb_ubc = &current->tlb_ubc_ro;
+		fold_luf_batch_mm(&mm->luf_batch, mm);
+	}
 
 	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 20/25] mm, fs: skip tlb flushes for luf'd filemap that already has been done
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (17 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 19/25] mm: skip luf tlb flush for luf'd mm that already has been done Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 21/25] mm: perform luf tlb shootdown per zone in batched manner Byungchul Park
                             ` (4 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

For luf'd filemap, tlb shootdown is performed when updating page cache,
no matter whether tlb flushes required already has been done or not.

By storing luf meta data in struct address_space and updating the luf
meta data properly, we can skip unnecessary tlb flush.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 fs/inode.c               |  1 +
 include/linux/fs.h       |  4 ++-
 include/linux/mm_types.h |  2 ++
 mm/memory.c              |  4 +--
 mm/rmap.c                | 59 +++++++++++++++++++++++++---------------
 mm/truncate.c            | 14 +++++-----
 mm/vmscan.c              |  2 +-
 7 files changed, 53 insertions(+), 33 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 5587aabdaa5ee..752fb2df6f3b3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -475,6 +475,7 @@ static void __address_space_init_once(struct address_space *mapping)
 	init_rwsem(&mapping->i_mmap_rwsem);
 	INIT_LIST_HEAD(&mapping->i_private_list);
 	spin_lock_init(&mapping->i_private_lock);
+	luf_batch_init(&mapping->luf_batch);
 	mapping->i_mmap = RB_ROOT_CACHED;
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 78aaf769d32d1..a2f014b31028f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -498,6 +498,7 @@ extern const struct address_space_operations empty_aops;
  * @i_private_lock: For use by the owner of the address_space.
  * @i_private_list: For use by the owner of the address_space.
  * @i_private_data: For use by the owner of the address_space.
+ * @luf_batch: Data to track need of tlb flush by luf.
  */
 struct address_space {
 	struct inode		*host;
@@ -519,6 +520,7 @@ struct address_space {
 	struct list_head	i_private_list;
 	struct rw_semaphore	i_mmap_rwsem;
 	void *			i_private_data;
+	struct luf_batch	luf_batch;
 } __attribute__((aligned(sizeof(long)))) __randomize_layout;
 	/*
 	 * On most architectures that alignment is already the case; but
@@ -545,7 +547,7 @@ static inline int mapping_write_begin(struct file *file,
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
 	if (!ret)
-		luf_flush(0);
+		luf_flush_mapping(mapping);
 
 	return ret;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2ac93d4f67c15..96015fc68e4f5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1348,10 +1348,12 @@ extern void tlb_finish_mmu(struct mmu_gather *tlb);
 void luf_flush(unsigned short luf_key);
 void luf_flush_mm(struct mm_struct *mm);
 void luf_flush_vma(struct vm_area_struct *vma);
+void luf_flush_mapping(struct address_space *mapping);
 #else
 static inline void luf_flush(unsigned short luf_key) {}
 static inline void luf_flush_mm(struct mm_struct *mm) {}
 static inline void luf_flush_vma(struct vm_area_struct *vma) {}
+static inline void luf_flush_mapping(struct address_space *mapping) {}
 #endif
 
 struct vm_fault;
diff --git a/mm/memory.c b/mm/memory.c
index 52bd45fe00f55..6cdc1df0424f3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6261,10 +6261,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	if (flush) {
 		/*
 		 * If it has a VM_SHARED mapping, all the mms involved
-		 * should be luf_flush'ed.
+		 * in the struct address_space should be luf_flush'ed.
 		 */
 		if (mapping)
-			luf_flush(0);
+			luf_flush_mapping(mapping);
 		luf_flush_mm(mm);
 	}
 
diff --git a/mm/rmap.c b/mm/rmap.c
index d68cfd28e0939..58dfc9889b1ee 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -691,7 +691,7 @@ void fold_batch(struct tlbflush_unmap_batch *dst,
 #define NR_LUF_BATCH (1 << (sizeof(short) * 8))
 
 /*
- * Use 0th entry as accumulated batch.
+ * XXX: Reserve the 0th entry for later use.
  */
 struct luf_batch luf_batch[NR_LUF_BATCH];
 
@@ -936,7 +936,7 @@ void luf_flush_vma(struct vm_area_struct *vma)
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping)
-		luf_flush(0);
+		luf_flush_mapping(mapping);
 	luf_flush_mm(mm);
 }
 
@@ -962,6 +962,29 @@ void luf_flush_mm(struct mm_struct *mm)
 	try_to_unmap_flush();
 }
 
+void luf_flush_mapping(struct address_space *mapping)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct luf_batch *lb;
+	unsigned long flags;
+	unsigned long lb_ugen;
+
+	if (!mapping)
+		return;
+
+	lb = &mapping->luf_batch;
+	read_lock_irqsave(&lb->lock, flags);
+	fold_batch(tlb_ubc, &lb->batch, false);
+	lb_ugen = lb->ugen;
+	read_unlock_irqrestore(&lb->lock, flags);
+
+	if (arch_tlbbatch_diet(&tlb_ubc->arch, lb_ugen))
+		return;
+
+	try_to_unmap_flush();
+}
+EXPORT_SYMBOL(luf_flush_mapping);
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
@@ -1010,7 +1033,8 @@ void try_to_unmap_flush_dirty(void)
 
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 		unsigned long uaddr,
-		struct vm_area_struct *vma)
+		struct vm_area_struct *vma,
+		struct address_space *mapping)
 {
 	struct tlbflush_unmap_batch *tlb_ubc;
 	int batch;
@@ -1032,27 +1056,15 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 		tlb_ubc = &current->tlb_ubc;
 	else {
 		tlb_ubc = &current->tlb_ubc_ro;
+
 		fold_luf_batch_mm(&mm->luf_batch, mm);
+		if (mapping)
+			fold_luf_batch_mm(&mapping->luf_batch, mm);
 	}
 
 	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
 	tlb_ubc->flush_required = true;
 
-	if (can_luf_test()) {
-		struct luf_batch *lb;
-		unsigned long flags;
-
-		/*
-		 * Accumulate to the 0th entry right away so that
-		 * luf_flush(0) can be uesed to properly perform pending
-		 * TLB flush once this unmapping is observed.
-		 */
-		lb = &luf_batch[0];
-		write_lock_irqsave(&lb->lock, flags);
-		__fold_luf_batch(lb, tlb_ubc, new_luf_ugen());
-		write_unlock_irqrestore(&lb->lock, flags);
-	}
-
 	/*
 	 * Ensure compiler does not re-order the setting of tlb_flush_batched
 	 * before the PTE is cleared.
@@ -1134,7 +1146,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 		unsigned long uaddr,
-		struct vm_area_struct *vma)
+		struct vm_area_struct *vma,
+		struct address_space *mapping)
 {
 }
 
@@ -1503,7 +1516,7 @@ int folio_mkclean(struct folio *folio)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 
 	return cleaned;
 }
@@ -2037,6 +2050,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long pfn;
 	unsigned long hsz = 0;
+	struct address_space *mapping = folio_mapping(folio);
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2174,7 +2188,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pteval, address, vma);
+				set_tlb_ubc_flush_pending(mm, pteval, address, vma, mapping);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
@@ -2414,6 +2428,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long pfn;
 	unsigned long hsz = 0;
+	struct address_space *mapping = folio_mapping(folio);
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2563,7 +2578,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pteval, address, vma);
+				set_tlb_ubc_flush_pending(mm, pteval, address, vma, mapping);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
diff --git a/mm/truncate.c b/mm/truncate.c
index 2bf3806391c21..b2934c4edebf1 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -128,7 +128,7 @@ void folio_invalidate(struct folio *folio, size_t offset, size_t length)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(folio->mapping);
 }
 EXPORT_SYMBOL_GPL(folio_invalidate);
 
@@ -169,7 +169,7 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 	return 0;
 }
 
@@ -219,7 +219,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(folio->mapping);
 
 	if (!folio_test_large(folio))
 		return true;
@@ -281,7 +281,7 @@ long mapping_evict_folio(struct address_space *mapping, struct folio *folio)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 
 	return ret;
 }
@@ -416,7 +416,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
 
@@ -536,7 +536,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 	return count;
 }
 
@@ -706,7 +706,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 461e7643898e7..a31a7cf87315f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -843,7 +843,7 @@ long remove_mapping(struct address_space *mapping, struct folio *folio)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 
 	return ret;
 }
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 21/25] mm: perform luf tlb shootdown per zone in batched manner
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (18 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 20/25] mm, fs: skip tlb flushes for luf'd filemap " Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 22/25] mm/page_alloc: not allow to tlb shootdown if !preemptable() && non_luf_pages_ok() Byungchul Park
                             ` (3 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Each luf page in buddy has its pending tlb shootdown information and
performs the corresponding tlb shootdown on exit from buddy.  However,
every exit from buddy causes small but frequent IPIs.  Even though total
IPIs get reduced, unnecessary waits on conflict CPUs in IPI handler have
been observed via perf profiling.

Thus, made it perfrom luf tlb shootdown per zone in batched manner when
pages exit from buddy so as to avoid frequent IPIs.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm.h       |  44 ++++-
 include/linux/mm_types.h |  19 +-
 include/linux/mmzone.h   |   9 +
 include/linux/sched.h    |   2 +
 mm/compaction.c          |  10 +-
 mm/internal.h            |  13 +-
 mm/mm_init.c             |   5 +
 mm/page_alloc.c          | 363 +++++++++++++++++++++++++++++++--------
 mm/page_reporting.c      |   9 +-
 mm/rmap.c                |   6 +-
 10 files changed, 383 insertions(+), 97 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8c3481402d8cb..e8e6562abc77d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4155,12 +4155,16 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
 int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
 int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
 
-#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 /*
  * luf_ugen will start with 2 so that 1 can be regarded as a passed one.
  */
 #define LUF_UGEN_INIT 2
+/*
+ * zone_ugen will start with 2 so that 1 can be regarded as done.
+ */
+#define ZONE_UGEN_INIT 2
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 static inline bool ugen_before(unsigned long a, unsigned long b)
 {
 	/*
@@ -4171,7 +4175,11 @@ static inline bool ugen_before(unsigned long a, unsigned long b)
 
 static inline unsigned long next_ugen(unsigned long ugen)
 {
-	if (ugen + 1)
+	/*
+	 * Avoid zero even in unsigned short range so as to treat
+	 * '(unsigned short)ugen == 0' as invalid.
+	 */
+	if ((unsigned short)(ugen + 1))
 		return ugen + 1;
 	/*
 	 * Avoid invalid ugen, zero.
@@ -4181,7 +4189,11 @@ static inline unsigned long next_ugen(unsigned long ugen)
 
 static inline unsigned long prev_ugen(unsigned long ugen)
 {
-	if (ugen - 1)
+	/*
+	 * Avoid zero even in unsigned short range so as to treat
+	 * '(unsigned short)ugen == 0' as invalid.
+	 */
+	if ((unsigned short)(ugen - 1))
 		return ugen - 1;
 	/*
 	 * Avoid invalid ugen, zero.
@@ -4189,4 +4201,30 @@ static inline unsigned long prev_ugen(unsigned long ugen)
 	return ugen - 2;
 }
 #endif
+
+/*
+ * return the biggest ugen but it should be before the real zone_ugen.
+ */
+static inline unsigned long page_zone_ugen(struct zone *zone, struct page *page)
+{
+	unsigned long zone_ugen = zone->zone_ugen;
+	unsigned short short_zone_ugen = page->zone_ugen;
+	unsigned long cand1, cand2;
+
+	if (!short_zone_ugen)
+		return 0;
+
+	cand1 = (zone_ugen & ~(unsigned long)USHRT_MAX) | short_zone_ugen;
+	cand2 = cand1 - USHRT_MAX - 1;
+
+	if (!ugen_before(zone_ugen, cand1))
+		return cand1;
+
+	return cand2;
+}
+
+static inline void set_page_zone_ugen(struct page *page, unsigned short zone_ugen)
+{
+	page->zone_ugen = zone_ugen;
+}
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 96015fc68e4f5..c5f44b5c9758f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -132,11 +132,20 @@ struct page {
 					 */
 					unsigned short order;
 
-					/*
-					 * For tracking need of tlb flush,
-					 * by luf(lazy unmap flush).
-					 */
-					unsigned short luf_key;
+					union {
+						/*
+						 * For tracking need of
+						 * tlb flush, by
+						 * luf(lazy unmap flush).
+						 */
+						unsigned short luf_key;
+
+						/*
+						 * Casted zone_ugen with
+						 * unsigned short.
+						 */
+						unsigned short zone_ugen;
+					};
 				};
 			};
 		};
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e2c8d7147e361..df5bacd48a2a2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -117,6 +117,7 @@ extern int page_group_by_mobility_disabled;
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	struct list_head	pend_list[MIGRATE_TYPES];
+	unsigned long		pend_zone_ugen[MIGRATE_TYPES];
 	unsigned long		nr_free;
 };
 
@@ -1017,6 +1018,14 @@ struct zone {
 	atomic_long_t		vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
 	/* Count pages that need tlb shootdown on allocation */
 	atomic_long_t		nr_luf_pages;
+	/* Generation number for that tlb shootdown has been done */
+	unsigned long		zone_ugen_done;
+	/* Generation number to control zone batched tlb shootdown */
+	unsigned long		zone_ugen;
+	/* Approximate latest luf_ugen that have ever entered */
+	unsigned long		luf_ugen;
+	/* Accumulated tlb batch for this zone */
+	struct tlbflush_unmap_batch zone_batch;
 } ____cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 31efc88ce911a..96375274d0335 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1404,6 +1404,8 @@ struct task_struct {
 	int luf_no_shootdown;
 	int luf_takeoff_started;
 	unsigned long luf_ugen;
+	unsigned long zone_ugen;
+	unsigned long wait_zone_ugen;
 #endif
 
 	struct tlbflush_unmap_batch	tlb_ubc;
diff --git a/mm/compaction.c b/mm/compaction.c
index aa594a85d8aee..b7a7a6feb9eac 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -655,7 +655,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 				goto isolate_fail;
 		}
 
-		if (!luf_takeoff_check(page))
+		if (!luf_takeoff_check(cc->zone, page))
 			goto isolate_fail;
 
 		/* Found a free page, will break it into order-0 pages */
@@ -691,7 +691,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(cc->zone);
 
 	/*
 	 * Be careful to not go outside of the pageblock.
@@ -1613,7 +1613,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 			order_scanned++;
 			nr_scanned++;
 
-			if (unlikely(consider_pend && !luf_takeoff_check(freepage)))
+			if (unlikely(consider_pend && !luf_takeoff_check(cc->zone, freepage)))
 				goto scan_next;
 
 			pfn = page_to_pfn(freepage);
@@ -1681,7 +1681,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 		/*
 		 * Check and flush before using the pages taken off.
 		 */
-		luf_takeoff_end();
+		luf_takeoff_end(cc->zone);
 
 		/* Skip fast search if enough freepages isolated */
 		if (cc->nr_freepages >= cc->nr_migratepages)
@@ -2418,7 +2418,7 @@ static enum compact_result compact_finished(struct compact_control *cc)
 	 */
 	luf_takeoff_start();
 	ret = __compact_finished(cc);
-	luf_takeoff_end();
+	luf_takeoff_end(cc->zone);
 
 	trace_mm_compaction_finished(cc->zone, cc->order, ret);
 	if (ret == COMPACT_NO_SUITABLE_PAGE)
diff --git a/mm/internal.h b/mm/internal.h
index a95c46355e93d..6d7b3b389810e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1592,10 +1592,10 @@ static inline void accept_page(struct page *page)
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 extern struct luf_batch luf_batch[];
 bool luf_takeoff_start(void);
-void luf_takeoff_end(void);
+void luf_takeoff_end(struct zone *zone);
 bool luf_takeoff_no_shootdown(void);
-bool luf_takeoff_check(struct page *page);
-bool luf_takeoff_check_and_fold(struct page *page);
+bool luf_takeoff_check(struct zone *zone, struct page *page);
+bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page);
 
 static inline bool non_luf_pages_ok(struct zone *zone)
 {
@@ -1605,7 +1605,6 @@ static inline bool non_luf_pages_ok(struct zone *zone)
 
 	return nr_free - nr_luf_pages > min_wm;
 }
-
 unsigned short fold_unmap_luf(void);
 
 /*
@@ -1693,10 +1692,10 @@ static inline bool can_luf_vma(struct vm_area_struct *vma)
 }
 #else /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 static inline bool luf_takeoff_start(void) { return false; }
-static inline void luf_takeoff_end(void) {}
+static inline void luf_takeoff_end(struct zone *zone) {}
 static inline bool luf_takeoff_no_shootdown(void) { return true; }
-static inline bool luf_takeoff_check(struct page *page) { return true; }
-static inline bool luf_takeoff_check_and_fold(struct page *page) { return true; }
+static inline bool luf_takeoff_check(struct zone *zone, struct page *page) { return true; }
+static inline bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page) { return true; }
 static inline bool non_luf_pages_ok(struct zone *zone) { return true; }
 static inline unsigned short fold_unmap_luf(void) { return 0; }
 
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 41c38fbb58a30..69643c3564a47 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1400,6 +1400,7 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		INIT_LIST_HEAD(&zone->free_area[order].pend_list[t]);
+		zone->free_area[order].pend_zone_ugen[t] = ZONE_UGEN_INIT;
 		zone->free_area[order].nr_free = 0;
 	}
 
@@ -1407,6 +1408,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	INIT_LIST_HEAD(&zone->unaccepted_pages);
 #endif
 	atomic_long_set(&zone->nr_luf_pages, 0);
+	zone->zone_ugen_done = ZONE_UGEN_INIT - 1;
+	zone->zone_ugen = ZONE_UGEN_INIT;
+	zone->luf_ugen = LUF_UGEN_INIT - 1;
+	reset_batch(&zone->zone_batch);
 }
 
 void __meminit init_currently_empty_zone(struct zone *zone,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 811e7c4bd2d19..917a257ea5706 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -663,16 +663,29 @@ bool luf_takeoff_start(void)
 	return !no_shootdown;
 }
 
+static void wait_zone_ugen_done(struct zone *zone, unsigned long zone_ugen)
+{
+	while (ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen))
+		cond_resched();
+}
+
+static void set_zone_ugen_done(struct zone *zone, unsigned long zone_ugen)
+{
+	WRITE_ONCE(zone->zone_ugen_done, zone_ugen);
+}
+
 /*
  * Should be called within the same context of luf_takeoff_start().
  */
-void luf_takeoff_end(void)
+void luf_takeoff_end(struct zone *zone)
 {
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 	unsigned long flags;
 	bool no_shootdown;
 	bool outmost = false;
 	unsigned long cur_luf_ugen;
+	unsigned long cur_zone_ugen;
+	unsigned long cur_wait_zone_ugen;
 
 	local_irq_save(flags);
 	VM_WARN_ON(!current->luf_takeoff_started);
@@ -700,6 +713,8 @@ void luf_takeoff_end(void)
 		goto out;
 
 	cur_luf_ugen = current->luf_ugen;
+	cur_zone_ugen = current->zone_ugen;
+	cur_wait_zone_ugen = current->wait_zone_ugen;
 
 	current->luf_ugen = 0;
 
@@ -707,10 +722,38 @@ void luf_takeoff_end(void)
 		reset_batch(tlb_ubc_takeoff);
 
 	try_to_unmap_flush_takeoff();
+
+	if (cur_wait_zone_ugen || cur_zone_ugen) {
+		/*
+		 * pcp(zone == NULL) doesn't work with zone batch.
+		 */
+		if (zone) {
+			current->zone_ugen = 0;
+			current->wait_zone_ugen = 0;
+
+			/*
+			 * Guarantee that tlb shootdown required for the
+			 * zone_ugen has been completed once observing
+			 * 'zone_ugen_done'.
+			 */
+			smp_mb();
+
+			/*
+			 * zone->zone_ugen_done should be updated
+			 * sequentially.
+			 */
+			if (cur_wait_zone_ugen)
+				wait_zone_ugen_done(zone, cur_wait_zone_ugen);
+			if (cur_zone_ugen)
+				set_zone_ugen_done(zone, cur_zone_ugen);
+		}
+	}
 out:
 	if (outmost) {
 		VM_WARN_ON(current->luf_no_shootdown);
 		VM_WARN_ON(current->luf_ugen);
+		VM_WARN_ON(current->zone_ugen);
+		VM_WARN_ON(current->wait_zone_ugen);
 	}
 }
 
@@ -741,9 +784,9 @@ bool luf_takeoff_no_shootdown(void)
  * Should be called with either zone lock held and irq disabled or pcp
  * lock held.
  */
-bool luf_takeoff_check(struct page *page)
+bool luf_takeoff_check(struct zone *zone, struct page *page)
 {
-	unsigned short luf_key = page_luf_key(page);
+	unsigned long zone_ugen;
 
 	/*
 	 * No way.  Delimit using luf_takeoff_{start,end}().
@@ -753,7 +796,29 @@ bool luf_takeoff_check(struct page *page)
 		return false;
 	}
 
-	if (!luf_key)
+	if (!zone) {
+		unsigned short luf_key = page_luf_key(page);
+
+		if (!luf_key)
+			return true;
+
+		if (current->luf_no_shootdown)
+			return false;
+
+		return true;
+	}
+
+	zone_ugen = page_zone_ugen(zone, page);
+	if (!zone_ugen)
+		return true;
+
+	/*
+	 * Should not be zero since zone-zone_ugen has been updated in
+	 * __free_one_page() -> update_zone_batch().
+	 */
+	VM_WARN_ON(!zone->zone_ugen);
+
+	if (!ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen))
 		return true;
 
 	return !current->luf_no_shootdown;
@@ -763,13 +828,11 @@ bool luf_takeoff_check(struct page *page)
  * Should be called with either zone lock held and irq disabled or pcp
  * lock held.
  */
-bool luf_takeoff_check_and_fold(struct page *page)
+bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 {
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
-	unsigned short luf_key = page_luf_key(page);
-	struct luf_batch *lb;
-	unsigned long lb_ugen;
 	unsigned long flags;
+	unsigned long zone_ugen;
 
 	/*
 	 * No way.  Delimit using luf_takeoff_{start,end}().
@@ -779,28 +842,94 @@ bool luf_takeoff_check_and_fold(struct page *page)
 		return false;
 	}
 
-	if (!luf_key)
-		return true;
+	/*
+	 * pcp case
+	 */
+	if (!zone) {
+		unsigned short luf_key = page_luf_key(page);
+		struct luf_batch *lb;
+		unsigned long lb_ugen;
 
-	lb = &luf_batch[luf_key];
-	read_lock_irqsave(&lb->lock, flags);
-	lb_ugen = lb->ugen;
+		if (!luf_key)
+			return true;
+
+		lb = &luf_batch[luf_key];
+		read_lock_irqsave(&lb->lock, flags);
+		lb_ugen = lb->ugen;
+
+		if (arch_tlbbatch_check_done(&lb->batch.arch, lb_ugen)) {
+			read_unlock_irqrestore(&lb->lock, flags);
+			return true;
+		}
+
+		if (current->luf_no_shootdown) {
+			read_unlock_irqrestore(&lb->lock, flags);
+			return false;
+		}
 
-	if (arch_tlbbatch_check_done(&lb->batch.arch, lb_ugen)) {
+		fold_batch(tlb_ubc_takeoff, &lb->batch, false);
 		read_unlock_irqrestore(&lb->lock, flags);
+
+		if (!current->luf_ugen || ugen_before(current->luf_ugen, lb_ugen))
+			current->luf_ugen = lb_ugen;
 		return true;
 	}
 
-	if (current->luf_no_shootdown) {
-		read_unlock_irqrestore(&lb->lock, flags);
+	zone_ugen = page_zone_ugen(zone, page);
+	if (!zone_ugen)
+		return true;
+
+	/*
+	 * Should not be zero since zone-zone_ugen has been updated in
+	 * __free_one_page() -> update_zone_batch().
+	 */
+	VM_WARN_ON(!zone->zone_ugen);
+
+	if (!ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen))
+		return true;
+
+	if (current->luf_no_shootdown)
 		return false;
-	}
 
-	fold_batch(tlb_ubc_takeoff, &lb->batch, false);
-	read_unlock_irqrestore(&lb->lock, flags);
+	/*
+	 * zone batched flush has been already set.
+	 */
+	if (current->zone_ugen)
+		return true;
+
+	/*
+	 * Others are already performing tlb shootdown for us.  All we
+	 * need is to wait for those to complete.
+	 */
+	if (zone_ugen != zone->zone_ugen) {
+		if (!current->wait_zone_ugen ||
+		    ugen_before(current->wait_zone_ugen, zone_ugen))
+			current->wait_zone_ugen = zone_ugen;
+	/*
+	 * It's the first time that zone->zone_ugen has been set to
+	 * current->zone_ugen.  current->luf_ugen also get set.
+	 */
+	} else {
+		current->wait_zone_ugen = prev_ugen(zone->zone_ugen);
+		current->zone_ugen = zone->zone_ugen;
+		current->luf_ugen = zone->luf_ugen;
+
+		/*
+		 * Now that tlb shootdown for the zone_ugen will be
+		 * performed at luf_takeoff_end(), advance it so that
+		 * the next zone->lock holder can efficiently avoid
+		 * unnecessary tlb shootdown.
+		 */
+		zone->zone_ugen = next_ugen(zone->zone_ugen);
 
-	if (!current->luf_ugen || ugen_before(current->luf_ugen, lb_ugen))
-		current->luf_ugen = lb_ugen;
+		/*
+		 * All the luf pages will eventually become non-luf
+		 * pages by tlb flushing at luf_takeoff_end() and,
+		 * flush_pend_list_if_done() will empty pend_list.
+		 */
+		atomic_long_set(&zone->nr_luf_pages, 0);
+		fold_batch(tlb_ubc_takeoff, &zone->zone_batch, true);
+	}
 	return true;
 }
 #endif
@@ -822,6 +951,42 @@ static inline void account_freepages(struct zone *zone, int nr_pages,
 			   zone->nr_free_highatomic + nr_pages);
 }
 
+static void flush_pend_list_if_done(struct zone *zone,
+		struct free_area *area, int migratetype)
+{
+	unsigned long zone_ugen_done = READ_ONCE(zone->zone_ugen_done);
+
+	/*
+	 * tlb shootdown required for the zone_ugen already has been
+	 * done.  Thus, let's move pages in pend_list to free_list to
+	 * secure more non-luf pages.
+	 */
+	if (!ugen_before(zone_ugen_done, area->pend_zone_ugen[migratetype]))
+		list_splice_init(&area->pend_list[migratetype],
+				 &area->free_list[migratetype]);
+}
+
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+/*
+ * Should be called with zone->lock held and irq disabled.
+ */
+static void update_zone_batch(struct zone *zone, unsigned short luf_key)
+{
+	unsigned long lb_ugen;
+	struct luf_batch *lb = &luf_batch[luf_key];
+
+	read_lock(&lb->lock);
+	fold_batch(&zone->zone_batch, &lb->batch, false);
+	lb_ugen = lb->ugen;
+	read_unlock(&lb->lock);
+
+	if (ugen_before(zone->luf_ugen, lb_ugen))
+		zone->luf_ugen = lb_ugen;
+}
+#else
+static void update_zone_batch(struct zone *zone, unsigned short luf_key) {}
+#endif
+
 /* Used for pages not on another list */
 static inline void __add_to_free_list(struct page *page, struct zone *zone,
 				      unsigned int order, int migratetype,
@@ -830,6 +995,12 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
 	struct free_area *area = &zone->free_area[order];
 	struct list_head *list;
 
+	/*
+	 * Good chance to flush pend_list just before updating the
+	 * {free,pend}_list.
+	 */
+	flush_pend_list_if_done(zone, area, migratetype);
+
 	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
 		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
 		     get_pageblock_migratetype(page), migratetype, 1 << order);
@@ -839,8 +1010,9 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
 	 * positive is okay because it will cause just additional tlb
 	 * shootdown.
 	 */
-	if (page_luf_key(page)) {
+	if (page_zone_ugen(zone, page)) {
 		list = &area->pend_list[migratetype];
+		area->pend_zone_ugen[migratetype] = zone->zone_ugen;
 		atomic_long_add(1 << order, &zone->nr_luf_pages);
 	} else
 		list = &area->free_list[migratetype];
@@ -862,6 +1034,7 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 				     unsigned int order, int old_mt, int new_mt)
 {
 	struct free_area *area = &zone->free_area[order];
+	unsigned long zone_ugen = page_zone_ugen(zone, page);
 
 	/* Free page moving can fail, so it happens before the type update */
 	VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt,
@@ -878,9 +1051,12 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 	 * positive is okay because it will cause just additional tlb
 	 * shootdown.
 	 */
-	if (page_luf_key(page))
+	if (zone_ugen) {
 		list_move_tail(&page->buddy_list, &area->pend_list[new_mt]);
-	else
+		if (!area->pend_zone_ugen[new_mt] ||
+		    ugen_before(area->pend_zone_ugen[new_mt], zone_ugen))
+			area->pend_zone_ugen[new_mt] = zone_ugen;
+	} else
 		list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
 
 	account_freepages(zone, -(1 << order), old_mt);
@@ -898,7 +1074,7 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
 	if (page_reported(page))
 		__ClearPageReported(page);
 
-	if (page_luf_key(page))
+	if (page_zone_ugen(zone, page))
 		atomic_long_sub(1 << order, &zone->nr_luf_pages);
 
 	list_del(&page->buddy_list);
@@ -936,29 +1112,39 @@ static inline struct page *get_page_from_free_area(struct zone *zone,
 	 */
 	pend_first = !non_luf_pages_ok(zone);
 
+	/*
+	 * Good chance to flush pend_list just before updating the
+	 * {free,pend}_list.
+	 */
+	flush_pend_list_if_done(zone, area, migratetype);
+
 	if (pend_first) {
 		page = list_first_entry_or_null(&area->pend_list[migratetype],
 				struct page, buddy_list);
 
-		if (page && luf_takeoff_check(page))
+		if (page && luf_takeoff_check(zone, page))
 			return page;
 
 		page = list_first_entry_or_null(&area->free_list[migratetype],
 				struct page, buddy_list);
 
-		if (page)
+		if (page) {
+			set_page_zone_ugen(page, 0);
 			return page;
+		}
 	} else {
 		page = list_first_entry_or_null(&area->free_list[migratetype],
 				struct page, buddy_list);
 
-		if (page)
+		if (page) {
+			set_page_zone_ugen(page, 0);
 			return page;
+		}
 
 		page = list_first_entry_or_null(&area->pend_list[migratetype],
 				struct page, buddy_list);
 
-		if (page && luf_takeoff_check(page))
+		if (page && luf_takeoff_check(zone, page))
 			return page;
 	}
 	return NULL;
@@ -1023,6 +1209,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned long combined_pfn;
 	struct page *buddy;
 	bool to_tail;
+	unsigned long zone_ugen;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
@@ -1034,20 +1221,25 @@ static inline void __free_one_page(struct page *page,
 	account_freepages(zone, 1 << order, migratetype);
 
 	/*
-	 * Use the page's luf_key unchanged if luf_key == 0.  Worth
-	 * noting that page_luf_key() will be 0 in most cases since it's
-	 * initialized at free_pages_prepare().
+	 * Use the page's zone_ugen unchanged if luf_key == 0.  Worth
+	 * noting that page_zone_ugen() will be 0 in most cases since
+	 * it's initialized at free_pages_prepare().
+	 *
+	 * Update page's zone_ugen and zone's batch only if a valid
+	 * luf_key was passed.
 	 */
-	if (luf_key)
-		set_page_luf_key(page, luf_key);
-	else
-		luf_key = page_luf_key(page);
+	if (luf_key) {
+		zone_ugen = zone->zone_ugen;
+		set_page_zone_ugen(page, (unsigned short)zone_ugen);
+		update_zone_batch(zone, luf_key);
+	} else
+		zone_ugen = page_zone_ugen(zone, page);
 
 	while (order < MAX_PAGE_ORDER) {
 		int buddy_mt = migratetype;
-		unsigned short buddy_luf_key;
+		unsigned long buddy_zone_ugen;
 
-		if (!luf_key && compaction_capture(capc, page, order, migratetype)) {
+		if (!zone_ugen && compaction_capture(capc, page, order, migratetype)) {
 			account_freepages(zone, -(1 << order), migratetype);
 			return;
 		}
@@ -1080,17 +1272,15 @@ static inline void __free_one_page(struct page *page,
 		else
 			__del_page_from_free_list(buddy, zone, order, buddy_mt);
 
+		buddy_zone_ugen = page_zone_ugen(zone, buddy);
+
 		/*
-		 * !buddy_luf_key && !luf_key : do nothing
-		 *  buddy_luf_key && !luf_key : luf_key = buddy_luf_key
-		 * !buddy_luf_key &&  luf_key : do nothing
-		 *  buddy_luf_key &&  luf_key : merge two into luf_key
+		 * if (!zone_ugen && !buddy_zone_ugen) : nothing to do
+		 * if ( zone_ugen && !buddy_zone_ugen) : nothing to do
 		 */
-		buddy_luf_key = page_luf_key(buddy);
-		if (buddy_luf_key && !luf_key)
-			luf_key = buddy_luf_key;
-		else if (buddy_luf_key && luf_key)
-			fold_luf_batch(&luf_batch[luf_key], &luf_batch[buddy_luf_key]);
+		if ((!zone_ugen && buddy_zone_ugen) ||
+		    ( zone_ugen && buddy_zone_ugen && ugen_before(zone_ugen, buddy_zone_ugen)))
+			zone_ugen = buddy_zone_ugen;
 
 		if (unlikely(buddy_mt != migratetype)) {
 			/*
@@ -1103,7 +1293,7 @@ static inline void __free_one_page(struct page *page,
 
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
-		set_page_luf_key(page, luf_key);
+		set_page_zone_ugen(page, zone_ugen);
 		pfn = combined_pfn;
 		order++;
 	}
@@ -1446,6 +1636,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		do {
 			unsigned long pfn;
 			int mt;
+			unsigned short luf_key;
 
 			page = list_last_entry(list, struct page, pcp_list);
 			pfn = page_to_pfn(page);
@@ -1456,7 +1647,16 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			count -= nr_pages;
 			pcp->count -= nr_pages;
 
-			__free_one_page(page, pfn, zone, order, mt, FPI_NONE, 0);
+			/*
+			 * page private in pcp stores luf_key while it
+			 * stores zone_ugen in buddy.  Thus, the private
+			 * needs to be cleared and the luf_key needs to
+			 * be passed to buddy.
+			 */
+			luf_key = page_luf_key(page);
+			set_page_private(page, 0);
+
+			__free_one_page(page, pfn, zone, order, mt, FPI_NONE, luf_key);
 
 			trace_mm_page_pcpu_drain(page, order, mt);
 		} while (count > 0 && !list_empty(list));
@@ -1501,7 +1701,15 @@ static void free_one_page(struct zone *zone, struct page *page,
 	 * valid luf_key can be passed only if order == 0.
 	 */
 	VM_WARN_ON(luf_key && order);
-	set_page_luf_key(page, luf_key);
+
+	/*
+	 * Update page's zone_ugen and zone's batch only if a valid
+	 * luf_key was passed.
+	 */
+	if (luf_key) {
+		set_page_zone_ugen(page, (unsigned short)zone->zone_ugen);
+		update_zone_batch(zone, luf_key);
+	}
 
 	split_large_buddy(zone, page, pfn, order, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -1655,7 +1863,7 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 		if (set_page_guard(zone, &page[size], high))
 			continue;
 
-		if (page_luf_key(&page[size]))
+		if (page_zone_ugen(zone, &page[size]))
 			tail = true;
 
 		__add_to_free_list(&page[size], zone, high, migratetype, tail);
@@ -1673,7 +1881,7 @@ static __always_inline void page_del_and_expand(struct zone *zone,
 	int nr_pages = 1 << high;
 
 	__del_page_from_free_list(page, zone, high, migratetype);
-	if (unlikely(!luf_takeoff_check_and_fold(page)))
+	if (unlikely(!luf_takeoff_check_and_fold(zone, page)))
 		VM_WARN_ON(1);
 	nr_pages -= expand(zone, page, low, high, migratetype);
 	account_freepages(zone, -nr_pages, migratetype);
@@ -2202,7 +2410,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
 		unsigned int nr_added;
 
 		del_page_from_free_list(page, zone, current_order, block_type);
-		if (unlikely(!luf_takeoff_check_and_fold(page)))
+		if (unlikely(!luf_takeoff_check_and_fold(zone, page)))
 			VM_WARN_ON(1);
 		change_pageblock_range(page, current_order, start_type);
 		nr_added = expand(zone, page, order, current_order, start_type);
@@ -2441,12 +2649,12 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			WARN_ON_ONCE(ret == -1);
 			if (ret > 0) {
 				spin_unlock_irqrestore(&zone->lock, flags);
-				luf_takeoff_end();
+				luf_takeoff_end(zone);
 				return ret;
 			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
-		luf_takeoff_end();
+		luf_takeoff_end(zone);
 	}
 
 	return false;
@@ -2611,12 +2819,15 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * pages are ordered properly.
 		 */
 		list_add_tail(&page->pcp_list, list);
+
+		/*
+		 * Reset all the luf fields.  tlb shootdown will be
+		 * performed at luf_takeoff_end() below if needed.
+		 */
+		set_page_private(page, 0);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
-	/*
-	 * Check and flush before using the pages taken off.
-	 */
-	luf_takeoff_end();
+	luf_takeoff_end(zone);
 
 	return i;
 }
@@ -3130,7 +3341,7 @@ int __isolate_free_page(struct page *page, unsigned int order, bool willputback)
 	}
 
 	del_page_from_free_list(page, zone, order, mt);
-	if (unlikely(!willputback && !luf_takeoff_check_and_fold(page)))
+	if (unlikely(!willputback && !luf_takeoff_check_and_fold(zone, page)))
 		VM_WARN_ON(1);
 
 	/*
@@ -3229,7 +3440,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 			if (!page) {
 				spin_unlock_irqrestore(&zone->lock, flags);
-				luf_takeoff_end();
+				luf_takeoff_end(zone);
 				return NULL;
 			}
 		}
@@ -3237,7 +3448,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 		/*
 		 * Check and flush before using the pages taken off.
 		 */
-		luf_takeoff_end();
+		luf_takeoff_end(zone);
 	} while (check_new_pages(page, order));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3327,7 +3538,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		}
 
 		list_for_each_entry(page, list, pcp_list) {
-			if (luf_takeoff_check_and_fold(page)) {
+			if (luf_takeoff_check_and_fold(NULL, page)) {
 				list_del(&page->pcp_list);
 				pcp->count -= 1 << order;
 				break;
@@ -3362,7 +3573,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (!pcp) {
 		pcp_trylock_finish(UP_flags);
-		luf_takeoff_end();
+		luf_takeoff_end(NULL);
 		return NULL;
 	}
 
@@ -3379,7 +3590,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(NULL);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone, 1);
@@ -3418,6 +3629,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 							migratetype);
 
 out:
+
 	/* Separate test+clear to avoid unnecessary atomics */
 	if ((alloc_flags & ALLOC_KSWAPD) &&
 	    unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
@@ -5017,7 +5229,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(NULL);
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
 	zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_account);
@@ -5027,7 +5239,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 
 failed_irq:
 	pcp_trylock_finish(UP_flags);
-	luf_takeoff_end();
+	luf_takeoff_end(NULL);
 
 failed:
 	page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
@@ -7111,7 +7323,7 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 		VM_WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE);
 		order = buddy_order(page);
 		del_page_from_free_list(page, zone, order, MIGRATE_ISOLATE);
-		if (unlikely(!luf_takeoff_check_and_fold(page)))
+		if (unlikely(!luf_takeoff_check_and_fold(zone, page)))
 			VM_WARN_ON(1);
 		pfn += (1 << order);
 	}
@@ -7119,7 +7331,7 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(zone);
 
 	return end_pfn - start_pfn - already_offline;
 }
@@ -7181,7 +7393,7 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, current_buddy, high))
 			continue;
 
-		if (page_luf_key(current_buddy))
+		if (page_zone_ugen(zone, current_buddy))
 			tail = true;
 
 		add_to_free_list(current_buddy, zone, high, migratetype, tail);
@@ -7213,7 +7425,7 @@ bool take_page_off_buddy(struct page *page)
 
 			del_page_from_free_list(page_head, zone, page_order,
 						migratetype);
-			if (unlikely(!luf_takeoff_check_and_fold(page_head)))
+			if (unlikely(!luf_takeoff_check_and_fold(zone, page_head)))
 				VM_WARN_ON(1);
 			break_down_buddy_pages(zone, page_head, page, 0,
 						page_order, migratetype);
@@ -7229,7 +7441,7 @@ bool take_page_off_buddy(struct page *page)
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(zone);
 	return ret;
 }
 
@@ -7248,6 +7460,13 @@ bool put_page_back_buddy(struct page *page)
 		int migratetype = get_pfnblock_migratetype(page, pfn);
 
 		ClearPageHWPoisonTakenOff(page);
+
+		/*
+		 * Reset all the luf fields.  tlb shootdown has already
+		 * been performed by take_page_off_buddy().
+		 */
+		set_page_private(page, 0);
+
 		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE, 0);
 		if (TestClearPageHWPoison(page)) {
 			ret = true;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index e152b22fbba8a..b23d3ed34ec07 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -118,7 +118,8 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
 
 		/*
 		 * Ensure private is zero before putting into the
-		 * allocator.
+		 * allocator.  tlb shootdown has already been performed
+		 * at isolation.
 		 */
 		set_page_private(page, 0);
 
@@ -194,7 +195,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		if (PageReported(page))
 			continue;
 
-		if (unlikely(consider_pend && !luf_takeoff_check(page))) {
+		if (unlikely(consider_pend && !luf_takeoff_check(zone, page))) {
 			VM_WARN_ON(1);
 			continue;
 		}
@@ -238,7 +239,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		/*
 		 * Check and flush before using the pages taken off.
 		 */
-		luf_takeoff_end();
+		luf_takeoff_end(zone);
 
 		/* begin processing pages in local list */
 		err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
@@ -283,7 +284,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(zone);
 
 	return err;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 58dfc9889b1ee..b6613b48669ac 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -650,7 +650,11 @@ static unsigned long new_luf_ugen(void)
 {
 	unsigned long ugen = atomic_long_inc_return(&luf_ugen);
 
-	if (!ugen)
+	/*
+	 * Avoid zero even in unsigned short range so as to treat
+	 * '(unsigned short)ugen == 0' as invalid.
+	 */
+	if (!(unsigned short)ugen)
 		ugen = atomic_long_inc_return(&luf_ugen);
 
 	return ugen;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 22/25] mm/page_alloc: not allow to tlb shootdown if !preemptable() && non_luf_pages_ok()
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (19 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 21/25] mm: perform luf tlb shootdown per zone in batched manner Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 23/25] mm/migrate: apply luf mechanism to unmapping during migration Byungchul Park
                             ` (2 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Do not perform tlb shootdown if the context is in preempt disable and
there are already enough non luf pages, not to hurt preemptibility.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/compaction.c     |  6 +++---
 mm/internal.h       |  5 +++--
 mm/page_alloc.c     | 27 +++++++++++++++------------
 mm/page_isolation.c |  2 +-
 mm/page_reporting.c |  4 ++--
 5 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b7a7a6feb9eac..aab400ec6a734 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -606,7 +606,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 
 	page = pfn_to_page(blockpfn);
 
-	luf_takeoff_start();
+	luf_takeoff_start(cc->zone);
 	/* Isolate free pages. */
 	for (; blockpfn < end_pfn; blockpfn += stride, page += stride) {
 		int isolated;
@@ -1603,7 +1603,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 		if (!area->nr_free)
 			continue;
 
-		can_shootdown = luf_takeoff_start();
+		can_shootdown = luf_takeoff_start(cc->zone);
 		spin_lock_irqsave(&cc->zone->lock, flags);
 		freelist = &area->free_list[MIGRATE_MOVABLE];
 retry:
@@ -2416,7 +2416,7 @@ static enum compact_result compact_finished(struct compact_control *cc)
 	 * luf_takeoff_{start,end}() is required to identify whether
 	 * this compaction context is tlb shootdownable for luf'd pages.
 	 */
-	luf_takeoff_start();
+	luf_takeoff_start(cc->zone);
 	ret = __compact_finished(cc);
 	luf_takeoff_end(cc->zone);
 
diff --git a/mm/internal.h b/mm/internal.h
index 6d7b3b389810e..b5f1928732498 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1591,7 +1591,7 @@ static inline void accept_page(struct page *page)
 #endif /* CONFIG_UNACCEPTED_MEMORY */
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 extern struct luf_batch luf_batch[];
-bool luf_takeoff_start(void);
+bool luf_takeoff_start(struct zone *zone);
 void luf_takeoff_end(struct zone *zone);
 bool luf_takeoff_no_shootdown(void);
 bool luf_takeoff_check(struct zone *zone, struct page *page);
@@ -1605,6 +1605,7 @@ static inline bool non_luf_pages_ok(struct zone *zone)
 
 	return nr_free - nr_luf_pages > min_wm;
 }
+
 unsigned short fold_unmap_luf(void);
 
 /*
@@ -1691,7 +1692,7 @@ static inline bool can_luf_vma(struct vm_area_struct *vma)
 	return true;
 }
 #else /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
-static inline bool luf_takeoff_start(void) { return false; }
+static inline bool luf_takeoff_start(struct zone *zone) { return false; }
 static inline void luf_takeoff_end(struct zone *zone) {}
 static inline bool luf_takeoff_no_shootdown(void) { return true; }
 static inline bool luf_takeoff_check(struct zone *zone, struct page *page) { return true; }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 917a257ea5706..2a2103df2d88e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -623,22 +623,25 @@ compaction_capture(struct capture_control *capc, struct page *page,
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
-static bool no_shootdown_context(void)
+static bool no_shootdown_context(struct zone *zone)
 {
 	/*
-	 * If it performs with irq disabled, that might cause a deadlock.
-	 * Avoid tlb shootdown in this case.
+	 * Tries to avoid tlb shootdown if !preemptible().  However, it
+	 * should be allowed under heavy memory pressure.
 	 */
+	if (zone && non_luf_pages_ok(zone))
+		return !(preemptible() && in_task());
+
 	return !(!irqs_disabled() && in_task());
 }
 
 /*
  * Can be called with zone lock released and irq enabled.
  */
-bool luf_takeoff_start(void)
+bool luf_takeoff_start(struct zone *zone)
 {
 	unsigned long flags;
-	bool no_shootdown = no_shootdown_context();
+	bool no_shootdown = no_shootdown_context(zone);
 
 	local_irq_save(flags);
 
@@ -2591,7 +2594,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 		 * luf_takeoff_{start,end}() is required for
 		 * get_page_from_free_area() to use luf_takeoff_check().
 		 */
-		luf_takeoff_start();
+		luf_takeoff_start(zone);
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < NR_PAGE_ORDERS; order++) {
 			struct free_area *area = &(zone->free_area[order]);
@@ -2796,7 +2799,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	unsigned long flags;
 	int i;
 
-	luf_takeoff_start();
+	luf_takeoff_start(zone);
 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
@@ -3422,7 +3425,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 	do {
 		page = NULL;
-		luf_takeoff_start();
+		luf_takeoff_start(zone);
 		spin_lock_irqsave(&zone->lock, flags);
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
@@ -3567,7 +3570,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct page *page;
 	unsigned long __maybe_unused UP_flags;
 
-	luf_takeoff_start();
+	luf_takeoff_start(NULL);
 	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
@@ -5190,7 +5193,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	if (unlikely(!zone))
 		goto failed;
 
-	luf_takeoff_start();
+	luf_takeoff_start(NULL);
 	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
@@ -7294,7 +7297,7 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 
 	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
-	luf_takeoff_start();
+	luf_takeoff_start(zone);
 	spin_lock_irqsave(&zone->lock, flags);
 	while (pfn < end_pfn) {
 		page = pfn_to_page(pfn);
@@ -7412,7 +7415,7 @@ bool take_page_off_buddy(struct page *page)
 	unsigned int order;
 	bool ret = false;
 
-	luf_takeoff_start();
+	luf_takeoff_start(zone);
 	spin_lock_irqsave(&zone->lock, flags);
 	for (order = 0; order < NR_PAGE_ORDERS; order++) {
 		struct page *page_head = page - (pfn & ((1 << order) - 1));
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c34659b58ca6c..f4055c0a2ea89 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -211,7 +211,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	struct page *buddy;
 
 	zone = page_zone(page);
-	luf_takeoff_start();
+	luf_takeoff_start(zone);
 	spin_lock_irqsave(&zone->lock, flags);
 	if (!is_migrate_isolate_page(page))
 		goto out;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index b23d3ed34ec07..83b66e7f0d257 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -170,7 +170,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	if (free_area_empty(area, mt))
 		return err;
 
-	can_shootdown = luf_takeoff_start();
+	can_shootdown = luf_takeoff_start(zone);
 	spin_lock_irq(&zone->lock);
 
 	/*
@@ -250,7 +250,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		/* update budget to reflect call to report function */
 		budget--;
 
-		luf_takeoff_start();
+		luf_takeoff_start(zone);
 
 		/* reacquire zone lock and resume processing */
 		spin_lock_irq(&zone->lock);
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 23/25] mm/migrate: apply luf mechanism to unmapping during migration
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (20 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 22/25] mm/page_alloc: not allow to tlb shootdown if !preemptable() && non_luf_pages_ok() Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 24/25] mm/vmscan: apply luf mechanism to unmapping during folio reclaim Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 25/25] mm/luf: implement luf debug feature Byungchul Park
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

Applied the mechanism to unmapping during migration.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm.h   |  2 ++
 include/linux/rmap.h |  2 +-
 mm/migrate.c         | 66 ++++++++++++++++++++++++++++++++++----------
 mm/rmap.c            | 15 ++++++----
 mm/swap.c            |  2 +-
 5 files changed, 64 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e8e6562abc77d..1577bc8b743fe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1489,6 +1489,8 @@ static inline void folio_put(struct folio *folio)
 		__folio_put(folio);
 }
 
+void page_cache_release(struct folio *folio);
+
 /**
  * folio_put_refs - Reduce the reference count on a folio.
  * @folio: The folio.
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 683a04088f3f2..cedba4812ccc7 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -660,7 +660,7 @@ static inline int folio_try_share_anon_rmap_pmd(struct folio *folio,
 int folio_referenced(struct folio *, int is_locked,
 			struct mem_cgroup *memcg, unsigned long *vm_flags);
 
-void try_to_migrate(struct folio *folio, enum ttu_flags flags);
+bool try_to_migrate(struct folio *folio, enum ttu_flags flags);
 void try_to_unmap(struct folio *, enum ttu_flags flags);
 
 int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
diff --git a/mm/migrate.c b/mm/migrate.c
index fb19a18892c89..7ce4d3dbcb1af 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1164,7 +1164,8 @@ static void migrate_folio_undo_dst(struct folio *dst, bool locked,
 
 /* Cleanup src folio upon migration success */
 static void migrate_folio_done(struct folio *src,
-			       enum migrate_reason reason)
+			       enum migrate_reason reason,
+			       unsigned short luf_key)
 {
 	/*
 	 * Compaction can migrate also non-LRU pages which are
@@ -1175,16 +1176,31 @@ static void migrate_folio_done(struct folio *src,
 		mod_node_page_state(folio_pgdat(src), NR_ISOLATED_ANON +
 				    folio_is_file_lru(src), -folio_nr_pages(src));
 
-	if (reason != MR_MEMORY_FAILURE)
-		/* We release the page in page_handle_poison. */
+	/* We release the page in page_handle_poison. */
+	if (reason == MR_MEMORY_FAILURE)
+		luf_flush(luf_key);
+	else if (!luf_key)
 		folio_put(src);
+	else {
+		/*
+		 * Should be the last reference.
+		 */
+		if (unlikely(!folio_put_testzero(src)))
+			VM_WARN_ON(1);
+
+		page_cache_release(src);
+		folio_unqueue_deferred_split(src);
+		mem_cgroup_uncharge(src);
+		free_frozen_pages(&src->page, folio_order(src), luf_key);
+	}
 }
 
 /* Obtain the lock on page, remove all ptes. */
 static int migrate_folio_unmap(new_folio_t get_new_folio,
 		free_folio_t put_new_folio, unsigned long private,
 		struct folio *src, struct folio **dstp, enum migrate_mode mode,
-		enum migrate_reason reason, struct list_head *ret)
+		enum migrate_reason reason, struct list_head *ret,
+		bool *can_luf)
 {
 	struct folio *dst;
 	int rc = -EAGAIN;
@@ -1200,7 +1216,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
 		folio_clear_unevictable(src);
 		/* free_pages_prepare() will clear PG_isolated. */
 		list_del(&src->lru);
-		migrate_folio_done(src, reason);
+		migrate_folio_done(src, reason, 0);
 		return MIGRATEPAGE_SUCCESS;
 	}
 
@@ -1317,7 +1333,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
 		/* Establish migration ptes */
 		VM_BUG_ON_FOLIO(folio_test_anon(src) &&
 			       !folio_test_ksm(src) && !anon_vma, src);
-		try_to_migrate(src, mode == MIGRATE_ASYNC ? TTU_BATCH_FLUSH : 0);
+		*can_luf = try_to_migrate(src, mode == MIGRATE_ASYNC ? TTU_BATCH_FLUSH : 0);
 		old_page_state |= PAGE_WAS_MAPPED;
 	}
 
@@ -1345,7 +1361,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
 static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 			      struct folio *src, struct folio *dst,
 			      enum migrate_mode mode, enum migrate_reason reason,
-			      struct list_head *ret)
+			      struct list_head *ret, unsigned short luf_key)
 {
 	int rc;
 	int old_page_state = 0;
@@ -1399,7 +1415,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 	folio_unlock(src);
-	migrate_folio_done(src, reason);
+	migrate_folio_done(src, reason, luf_key);
 
 	return rc;
 out:
@@ -1694,7 +1710,7 @@ static void migrate_folios_move(struct list_head *src_folios,
 		struct list_head *ret_folios,
 		struct migrate_pages_stats *stats,
 		int *retry, int *thp_retry, int *nr_failed,
-		int *nr_retry_pages)
+		int *nr_retry_pages, unsigned short luf_key)
 {
 	struct folio *folio, *folio2, *dst, *dst2;
 	bool is_thp;
@@ -1711,7 +1727,7 @@ static void migrate_folios_move(struct list_head *src_folios,
 
 		rc = migrate_folio_move(put_new_folio, private,
 				folio, dst, mode,
-				reason, ret_folios);
+				reason, ret_folios, luf_key);
 		/*
 		 * The rules are:
 		 *	Success: folio will be freed
@@ -1788,7 +1804,11 @@ static int migrate_pages_batch(struct list_head *from,
 	int rc, rc_saved = 0, nr_pages;
 	LIST_HEAD(unmap_folios);
 	LIST_HEAD(dst_folios);
+	LIST_HEAD(unmap_folios_luf);
+	LIST_HEAD(dst_folios_luf);
 	bool nosplit = (reason == MR_NUMA_MISPLACED);
+	unsigned short luf_key;
+	bool can_luf;
 
 	VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC &&
 			!list_empty(from) && !list_is_singular(from));
@@ -1863,9 +1883,11 @@ static int migrate_pages_batch(struct list_head *from,
 				continue;
 			}
 
+			can_luf = false;
 			rc = migrate_folio_unmap(get_new_folio, put_new_folio,
 					private, folio, &dst, mode, reason,
-					ret_folios);
+					ret_folios, &can_luf);
+
 			/*
 			 * The rules are:
 			 *	Success: folio will be freed
@@ -1911,7 +1933,8 @@ static int migrate_pages_batch(struct list_head *from,
 				/* nr_failed isn't updated for not used */
 				stats->nr_thp_failed += thp_retry;
 				rc_saved = rc;
-				if (list_empty(&unmap_folios))
+				if (list_empty(&unmap_folios) &&
+				    list_empty(&unmap_folios_luf))
 					goto out;
 				else
 					goto move;
@@ -1925,8 +1948,13 @@ static int migrate_pages_batch(struct list_head *from,
 				stats->nr_thp_succeeded += is_thp;
 				break;
 			case MIGRATEPAGE_UNMAP:
-				list_move_tail(&folio->lru, &unmap_folios);
-				list_add_tail(&dst->lru, &dst_folios);
+				if (can_luf) {
+					list_move_tail(&folio->lru, &unmap_folios_luf);
+					list_add_tail(&dst->lru, &dst_folios_luf);
+				} else {
+					list_move_tail(&folio->lru, &unmap_folios);
+					list_add_tail(&dst->lru, &dst_folios);
+				}
 				break;
 			default:
 				/*
@@ -1946,6 +1974,8 @@ static int migrate_pages_batch(struct list_head *from,
 	stats->nr_thp_failed += thp_retry;
 	stats->nr_failed_pages += nr_retry_pages;
 move:
+	/* Should be before try_to_unmap_flush() */
+	luf_key = fold_unmap_luf();
 	/* Flush TLBs for all unmapped folios */
 	try_to_unmap_flush();
 
@@ -1959,7 +1989,11 @@ static int migrate_pages_batch(struct list_head *from,
 		migrate_folios_move(&unmap_folios, &dst_folios,
 				put_new_folio, private, mode, reason,
 				ret_folios, stats, &retry, &thp_retry,
-				&nr_failed, &nr_retry_pages);
+				&nr_failed, &nr_retry_pages, 0);
+		migrate_folios_move(&unmap_folios_luf, &dst_folios_luf,
+				put_new_folio, private, mode, reason,
+				ret_folios, stats, &retry, &thp_retry,
+				&nr_failed, &nr_retry_pages, luf_key);
 	}
 	nr_failed += retry;
 	stats->nr_thp_failed += thp_retry;
@@ -1970,6 +2004,8 @@ static int migrate_pages_batch(struct list_head *from,
 	/* Cleanup remaining folios */
 	migrate_folios_undo(&unmap_folios, &dst_folios,
 			put_new_folio, private, ret_folios);
+	migrate_folios_undo(&unmap_folios_luf, &dst_folios_luf,
+			put_new_folio, private, ret_folios);
 
 	return rc;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index b6613b48669ac..284fc48aef2de 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2750,8 +2750,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
  *
  * Tries to remove all the page table entries which are mapping this folio and
  * replace them with special swap entries. Caller must hold the folio lock.
+ * Return true if all the mappings are read-only, otherwise false.
  */
-void try_to_migrate(struct folio *folio, enum ttu_flags flags)
+bool try_to_migrate(struct folio *folio, enum ttu_flags flags)
 {
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_migrate_one,
@@ -2769,11 +2770,11 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 	 */
 	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
 					TTU_SYNC | TTU_BATCH_FLUSH)))
-		return;
+		return false;
 
 	if (folio_is_zone_device(folio) &&
 	    (!folio_is_device_private(folio) && !folio_is_device_coherent(folio)))
-		return;
+		return false;
 
 	/*
 	 * During exec, a temporary VMA is setup and later moved.
@@ -2793,10 +2794,12 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 	else
 		rmap_walk(folio, &rwc);
 
-	if (can_luf_test())
+	if (can_luf_test()) {
 		fold_batch(tlb_ubc_luf, tlb_ubc_ro, true);
-	else
-		fold_batch(tlb_ubc, tlb_ubc_ro, true);
+		return true;
+	}
+	fold_batch(tlb_ubc, tlb_ubc_ro, true);
+	return false;
 }
 
 #ifdef CONFIG_DEVICE_PRIVATE
diff --git a/mm/swap.c b/mm/swap.c
index 0c6198e4a8ee4..e322670c30041 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -84,7 +84,7 @@ static void __page_cache_release(struct folio *folio, struct lruvec **lruvecp,
  * This path almost never happens for VM activity - pages are normally freed
  * in batches.  But it gets used by networking - and for compound pages.
  */
-static void page_cache_release(struct folio *folio)
+void page_cache_release(struct folio *folio)
 {
 	struct lruvec *lruvec = NULL;
 	unsigned long flags;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 24/25] mm/vmscan: apply luf mechanism to unmapping during folio reclaim
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (21 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 23/25] mm/migrate: apply luf mechanism to unmapping during migration Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 25/25] mm/luf: implement luf debug feature Byungchul Park
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

Applied the mechanism to unmapping during folio reclaim.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/rmap.h |  5 +++--
 mm/rmap.c            | 11 +++++++----
 mm/vmscan.c          | 37 ++++++++++++++++++++++++++++++++-----
 3 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index cedba4812ccc7..854b41441d466 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -661,7 +661,7 @@ int folio_referenced(struct folio *, int is_locked,
 			struct mem_cgroup *memcg, unsigned long *vm_flags);
 
 bool try_to_migrate(struct folio *folio, enum ttu_flags flags);
-void try_to_unmap(struct folio *, enum ttu_flags flags);
+bool try_to_unmap(struct folio *, enum ttu_flags flags);
 
 int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, struct page **pages,
@@ -794,8 +794,9 @@ static inline int folio_referenced(struct folio *folio, int is_locked,
 	return 0;
 }
 
-static inline void try_to_unmap(struct folio *folio, enum ttu_flags flags)
+static inline bool try_to_unmap(struct folio *folio, enum ttu_flags flags)
 {
+	return false;
 }
 
 static inline int folio_mkclean(struct folio *folio)
diff --git a/mm/rmap.c b/mm/rmap.c
index 284fc48aef2de..df350b4dfddd0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2386,10 +2386,11 @@ static int folio_not_mapped(struct folio *folio)
  * Tries to remove all the page table entries which are mapping this
  * folio.  It is the caller's responsibility to check if the folio is
  * still mapped if needed (use TTU_SYNC to prevent accounting races).
+ * Return true if all the mappings are read-only, otherwise false.
  *
  * Context: Caller must hold the folio lock.
  */
-void try_to_unmap(struct folio *folio, enum ttu_flags flags)
+bool try_to_unmap(struct folio *folio, enum ttu_flags flags)
 {
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_unmap_one,
@@ -2408,10 +2409,12 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags)
 	else
 		rmap_walk(folio, &rwc);
 
-	if (can_luf_test())
+	if (can_luf_test()) {
 		fold_batch(tlb_ubc_luf, tlb_ubc_ro, true);
-	else
-		fold_batch(tlb_ubc, tlb_ubc_ro, true);
+		return true;
+	}
+	fold_batch(tlb_ubc, tlb_ubc_ro, true);
+	return false;
 }
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a31a7cf87315f..065b40f36bbdd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1092,14 +1092,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		struct reclaim_stat *stat, bool ignore_references)
 {
 	struct folio_batch free_folios;
+	struct folio_batch free_folios_luf;
 	LIST_HEAD(ret_folios);
 	LIST_HEAD(demote_folios);
 	unsigned int nr_reclaimed = 0, nr_demoted = 0;
 	unsigned int pgactivate = 0;
 	bool do_demote_pass;
 	struct swap_iocb *plug = NULL;
+	unsigned short luf_key;
 
 	folio_batch_init(&free_folios);
+	folio_batch_init(&free_folios_luf);
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
 	do_demote_pass = can_demote(pgdat->node_id, sc);
@@ -1111,6 +1114,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		enum folio_references references = FOLIOREF_RECLAIM;
 		bool dirty, writeback;
 		unsigned int nr_pages;
+		bool can_luf = false;
 
 		cond_resched();
 
@@ -1344,7 +1348,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			if (folio_test_large(folio))
 				flags |= TTU_SYNC;
 
-			try_to_unmap(folio, flags);
+			can_luf = try_to_unmap(folio, flags);
 			if (folio_mapped(folio)) {
 				stat->nr_unmap_fail += nr_pages;
 				if (!was_swapbacked &&
@@ -1488,6 +1492,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					 * leave it off the LRU).
 					 */
 					nr_reclaimed += nr_pages;
+					if (can_luf)
+						luf_flush(fold_unmap_luf());
 					continue;
 				}
 			}
@@ -1520,6 +1526,19 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		nr_reclaimed += nr_pages;
 
 		folio_unqueue_deferred_split(folio);
+
+		if (can_luf) {
+			if (folio_batch_add(&free_folios_luf, folio) == 0) {
+				mem_cgroup_uncharge_folios(&free_folios);
+				mem_cgroup_uncharge_folios(&free_folios_luf);
+				luf_key = fold_unmap_luf();
+				try_to_unmap_flush();
+				free_unref_folios(&free_folios, 0);
+				free_unref_folios(&free_folios_luf, luf_key);
+			}
+			continue;
+		}
+
 		if (folio_batch_add(&free_folios, folio) == 0) {
 			mem_cgroup_uncharge_folios(&free_folios);
 			try_to_unmap_flush();
@@ -1554,9 +1573,21 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		list_add(&folio->lru, &ret_folios);
 		VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
 				folio_test_unevictable(folio), folio);
+		if (can_luf)
+			luf_flush(fold_unmap_luf());
 	}
 	/* 'folio_list' is always empty here */
 
+	/*
+	 * Finalize this turn before demote_folio_list().
+	 */
+	mem_cgroup_uncharge_folios(&free_folios);
+	mem_cgroup_uncharge_folios(&free_folios_luf);
+	luf_key = fold_unmap_luf();
+	try_to_unmap_flush();
+	free_unref_folios(&free_folios, 0);
+	free_unref_folios(&free_folios_luf, luf_key);
+
 	/* Migrate folios selected for demotion */
 	nr_demoted = demote_folio_list(&demote_folios, pgdat);
 	nr_reclaimed += nr_demoted;
@@ -1590,10 +1621,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 	pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
 
-	mem_cgroup_uncharge_folios(&free_folios);
-	try_to_unmap_flush();
-	free_unref_folios(&free_folios, 0);
-
 	list_splice(&ret_folios, folio_list);
 	count_vm_events(PGACTIVATE, pgactivate);
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on v6.14-rc4 25/25] mm/luf: implement luf debug feature
  2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (22 preceding siblings ...)
  2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 24/25] mm/vmscan: apply luf mechanism to unmapping during folio reclaim Byungchul Park
@ 2025-02-26 12:03           ` Byungchul Park
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

We need luf debug feature to detect when luf goes wrong by any chance.
As a RFC, suggest a simple implementation to report problematic
situations by luf.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/riscv/include/asm/tlbflush.h |   3 +
 arch/riscv/mm/tlbflush.c          |  35 ++++-
 arch/x86/include/asm/pgtable.h    |  10 ++
 arch/x86/include/asm/tlbflush.h   |   3 +
 arch/x86/mm/pgtable.c             |  10 ++
 arch/x86/mm/tlb.c                 |  35 ++++-
 include/linux/highmem-internal.h  |   5 +
 include/linux/mm.h                |  20 ++-
 include/linux/mm_types.h          |  16 +--
 include/linux/mm_types_task.h     |  16 +++
 include/linux/sched.h             |   5 +
 mm/highmem.c                      |   1 +
 mm/memory.c                       |  12 ++
 mm/page_alloc.c                   |  34 ++++-
 mm/page_ext.c                     |   3 +
 mm/rmap.c                         | 229 ++++++++++++++++++++++++++++++
 16 files changed, 418 insertions(+), 19 deletions(-)

diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index ec5caeb3cf8ef..9451f3d22f229 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -69,6 +69,9 @@ bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned
 bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
 void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
 void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen);
+#ifdef CONFIG_LUF_DEBUG
+extern void print_lufd_arch(void);
+#endif
 
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 93afb7a299003..de91bfe0426c2 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -216,6 +216,25 @@ static int __init luf_init_arch(void)
 }
 early_initcall(luf_init_arch);
 
+#ifdef CONFIG_LUF_DEBUG
+static DEFINE_SPINLOCK(luf_debug_lock);
+#define lufd_lock(f) spin_lock_irqsave(&luf_debug_lock, (f))
+#define lufd_unlock(f) spin_unlock_irqrestore(&luf_debug_lock, (f))
+
+void print_lufd_arch(void)
+{
+	int cpu;
+
+	pr_cont("LUFD ARCH:");
+	for_each_cpu(cpu, cpu_possible_mask)
+		pr_cont(" %lu", atomic_long_read(per_cpu_ptr(&ugen_done, cpu)));
+	pr_cont("\n");
+}
+#else
+#define lufd_lock(f) do { (void)(f); } while(0)
+#define lufd_unlock(f) do { (void)(f); } while(0)
+#endif
+
 /*
  * batch will not be updated.
  */
@@ -223,17 +242,22 @@ bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch,
 			unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		goto out;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		unsigned long done;
 
 		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
-		if (ugen_before(done, ugen))
+		if (ugen_before(done, ugen)) {
+			lufd_unlock(flags);
 			return false;
+		}
 	}
+	lufd_unlock(flags);
 	return true;
 out:
 	return cpumask_empty(&batch->cpumask);
@@ -243,10 +267,12 @@ bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
 			unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		goto out;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		unsigned long done;
 
@@ -254,6 +280,7 @@ bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
 		if (!ugen_before(done, ugen))
 			cpumask_clear_cpu(cpu, &batch->cpumask);
 	}
+	lufd_unlock(flags);
 out:
 	return cpumask_empty(&batch->cpumask);
 }
@@ -262,10 +289,12 @@ void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
 			     unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		return;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
 		unsigned long old = atomic_long_read(done);
@@ -283,15 +312,18 @@ void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
 		 */
 		atomic_long_cmpxchg(done, old, ugen);
 	}
+	lufd_unlock(flags);
 }
 
 void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		return;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, mm_cpumask(mm)) {
 		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
 		unsigned long old = atomic_long_read(done);
@@ -309,4 +341,5 @@ void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
 		 */
 		atomic_long_cmpxchg(done, old, ugen);
 	}
+	lufd_unlock(flags);
 }
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 593f10aabd45a..414bcabb23b51 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -695,12 +695,22 @@ static inline pud_t pud_mkyoung(pud_t pud)
 	return pud_set_flags(pud, _PAGE_ACCESSED);
 }
 
+#ifdef CONFIG_LUF_DEBUG
+pud_t pud_mkwrite(pud_t pud);
+static inline pud_t __pud_mkwrite(pud_t pud)
+{
+	pud = pud_set_flags(pud, _PAGE_RW);
+
+	return pud_clear_saveddirty(pud);
+}
+#else
 static inline pud_t pud_mkwrite(pud_t pud)
 {
 	pud = pud_set_flags(pud, _PAGE_RW);
 
 	return pud_clear_saveddirty(pud);
 }
+#endif
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline int pte_soft_dirty(pte_t pte)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dbcbf0477ed2a..03b3e90186ab1 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -298,6 +298,9 @@ extern bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, un
 extern bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
 extern void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
 extern void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen);
+#ifdef CONFIG_LUF_DEBUG
+extern void print_lufd_arch(void);
+#endif
 
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 1fef5ad32d5a8..d0b7a1437214c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -904,6 +904,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
+	lufd_check_pages(pte_page(pte), 0);
 	if (vma->vm_flags & VM_SHADOW_STACK)
 		return pte_mkwrite_shstk(pte);
 
@@ -914,6 +915,7 @@ pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 
 pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
+	lufd_check_pages(pmd_page(pmd), PMD_ORDER);
 	if (vma->vm_flags & VM_SHADOW_STACK)
 		return pmd_mkwrite_shstk(pmd);
 
@@ -922,6 +924,14 @@ pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd_clear_saveddirty(pmd);
 }
 
+#ifdef CONFIG_LUF_DEBUG
+pud_t pud_mkwrite(pud_t pud)
+{
+	lufd_check_pages(pud_page(pud), PUD_ORDER);
+	return __pud_mkwrite(pud);
+}
+#endif
+
 void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte)
 {
 	/*
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index be6068b60c32d..99b3d54aa74d2 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1283,6 +1283,25 @@ static int __init luf_init_arch(void)
 }
 early_initcall(luf_init_arch);
 
+#ifdef CONFIG_LUF_DEBUG
+static DEFINE_SPINLOCK(luf_debug_lock);
+#define lufd_lock(f) spin_lock_irqsave(&luf_debug_lock, (f))
+#define lufd_unlock(f) spin_unlock_irqrestore(&luf_debug_lock, (f))
+
+void print_lufd_arch(void)
+{
+	int cpu;
+
+	pr_cont("LUFD ARCH:");
+	for_each_cpu(cpu, cpu_possible_mask)
+		pr_cont(" %lu", atomic_long_read(per_cpu_ptr(&ugen_done, cpu)));
+	pr_cont("\n");
+}
+#else
+#define lufd_lock(f) do { (void)(f); } while(0)
+#define lufd_unlock(f) do { (void)(f); } while(0)
+#endif
+
 /*
  * batch will not be updated.
  */
@@ -1290,17 +1309,22 @@ bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch,
 			unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		goto out;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		unsigned long done;
 
 		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
-		if (ugen_before(done, ugen))
+		if (ugen_before(done, ugen)) {
+			lufd_unlock(flags);
 			return false;
+		}
 	}
+	lufd_unlock(flags);
 	return true;
 out:
 	return cpumask_empty(&batch->cpumask);
@@ -1310,10 +1334,12 @@ bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
 			unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		goto out;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		unsigned long done;
 
@@ -1321,6 +1347,7 @@ bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
 		if (!ugen_before(done, ugen))
 			cpumask_clear_cpu(cpu, &batch->cpumask);
 	}
+	lufd_unlock(flags);
 out:
 	return cpumask_empty(&batch->cpumask);
 }
@@ -1329,10 +1356,12 @@ void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
 			     unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		return;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
 		unsigned long old = atomic_long_read(done);
@@ -1350,15 +1379,18 @@ void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
 		 */
 		atomic_long_cmpxchg(done, old, ugen);
 	}
+	lufd_unlock(flags);
 }
 
 void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		return;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, mm_cpumask(mm)) {
 		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
 		unsigned long old = atomic_long_read(done);
@@ -1376,6 +1408,7 @@ void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
 		 */
 		atomic_long_cmpxchg(done, old, ugen);
 	}
+	lufd_unlock(flags);
 }
 
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
index dd100e849f5e0..0792530d1be7b 100644
--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -41,6 +41,7 @@ static inline void *kmap(struct page *page)
 {
 	void *addr;
 
+	lufd_check_pages(page, 0);
 	might_sleep();
 	if (!PageHighMem(page))
 		addr = page_address(page);
@@ -161,6 +162,7 @@ static inline struct page *kmap_to_page(void *addr)
 
 static inline void *kmap(struct page *page)
 {
+	lufd_check_pages(page, 0);
 	might_sleep();
 	return page_address(page);
 }
@@ -177,11 +179,13 @@ static inline void kunmap(struct page *page)
 
 static inline void *kmap_local_page(struct page *page)
 {
+	lufd_check_pages(page, 0);
 	return page_address(page);
 }
 
 static inline void *kmap_local_folio(struct folio *folio, size_t offset)
 {
+	lufd_check_folio(folio);
 	return page_address(&folio->page) + offset;
 }
 
@@ -204,6 +208,7 @@ static inline void __kunmap_local(const void *addr)
 
 static inline void *kmap_atomic(struct page *page)
 {
+	lufd_check_pages(page, 0);
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		migrate_disable();
 	else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1577bc8b743fe..5e577d5fba130 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -45,6 +45,24 @@ extern int sysctl_page_lock_unfairness;
 void mm_core_init(void);
 void init_mm_internals(void);
 
+#ifdef CONFIG_LUF_DEBUG
+void lufd_check_folio(struct folio *f);
+void lufd_check_pages(const struct page *p, unsigned int order);
+void lufd_check_zone_pages(struct zone *zone, struct page *page, unsigned int order);
+void lufd_check_queued_pages(void);
+void lufd_queue_page_for_check(struct page *page, int order);
+void lufd_mark_folio(struct folio *f, unsigned short luf_key);
+void lufd_mark_pages(struct page *p, unsigned int order, unsigned short luf_key);
+#else
+static inline void lufd_check_folio(struct folio *f) {}
+static inline void lufd_check_pages(const struct page *p, unsigned int order) {}
+static inline void lufd_check_zone_pages(struct zone *zone, struct page *page, unsigned int order) {}
+static inline void lufd_check_queued_pages(void) {}
+static inline void lufd_queue_page_for_check(struct page *page, int order) {}
+static inline void lufd_mark_folio(struct folio *f, unsigned short luf_key) {}
+static inline void lufd_mark_pages(struct page *p, unsigned int order, unsigned short luf_key) {}
+#endif
+
 #ifndef CONFIG_NUMA		/* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
 
@@ -114,7 +132,7 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #endif
 
 #ifndef page_to_virt
-#define page_to_virt(x)	__va(PFN_PHYS(page_to_pfn(x)))
+#define page_to_virt(x)	({ lufd_check_pages(x, 0); __va(PFN_PHYS(page_to_pfn(x)));})
 #endif
 
 #ifndef lm_alias
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c5f44b5c9758f..0cd83c1c231b9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -22,6 +22,10 @@
 
 #include <asm/mmu.h>
 
+#ifdef CONFIG_LUF_DEBUG
+extern struct page_ext_operations luf_debug_ops;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -32,18 +36,6 @@
 struct address_space;
 struct mem_cgroup;
 
-#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-struct luf_batch {
-	struct tlbflush_unmap_batch batch;
-	unsigned long ugen;
-	rwlock_t lock;
-};
-void luf_batch_init(struct luf_batch *lb);
-#else
-struct luf_batch {};
-static inline void luf_batch_init(struct luf_batch *lb) {}
-#endif
-
 /*
  * Each physical page in the system has a struct page associated with
  * it to keep track of whatever it is we are using the page for at the
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index a82aa80c0ba46..3b87f8674e528 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -10,6 +10,7 @@
 
 #include <linux/align.h>
 #include <linux/types.h>
+#include <linux/spinlock_types.h>
 
 #include <asm/page.h>
 
@@ -88,4 +89,19 @@ struct tlbflush_unmap_batch {
 #endif
 };
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+struct luf_batch {
+	struct tlbflush_unmap_batch batch;
+	unsigned long ugen;
+	rwlock_t lock;
+};
+void luf_batch_init(struct luf_batch *lb);
+#else
+struct luf_batch {};
+static inline void luf_batch_init(struct luf_batch *lb) {}
+#endif
+
+#if defined(CONFIG_LUF_DEBUG)
+#define NR_LUFD_PAGES 512
+#endif
 #endif /* _LINUX_MM_TYPES_TASK_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 96375274d0335..9cb8e6fa1b1b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1406,6 +1406,11 @@ struct task_struct {
 	unsigned long luf_ugen;
 	unsigned long zone_ugen;
 	unsigned long wait_zone_ugen;
+#if defined(CONFIG_LUF_DEBUG)
+	struct page *lufd_pages[NR_LUFD_PAGES];
+	int lufd_pages_order[NR_LUFD_PAGES];
+	int lufd_pages_nr;
+#endif
 #endif
 
 	struct tlbflush_unmap_batch	tlb_ubc;
diff --git a/mm/highmem.c b/mm/highmem.c
index ef3189b36cadb..a323d5a655bf9 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -576,6 +576,7 @@ void *__kmap_local_page_prot(struct page *page, pgprot_t prot)
 {
 	void *kmap;
 
+	lufd_check_pages(page, 0);
 	/*
 	 * To broaden the usage of the actual kmap_local() machinery always map
 	 * pages when debugging is enabled and the architecture has no problems
diff --git a/mm/memory.c b/mm/memory.c
index 6cdc1df0424f3..e7a0a89d7027e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6224,6 +6224,18 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mapping = vma->vm_file->f_mapping;
 	}
 
+#ifdef CONFIG_LUF_DEBUG
+	if (luf_flush) {
+		/*
+		 * If it has a VM_SHARED mapping, all the mms involved
+		 * in the struct address_space should be luf_flush'ed.
+		 */
+		if (mapping)
+			luf_flush_mapping(mapping);
+		luf_flush_mm(mm);
+	}
+#endif
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
 	else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2a2103df2d88e..9258d7c4eaf42 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -758,6 +758,8 @@ void luf_takeoff_end(struct zone *zone)
 		VM_WARN_ON(current->zone_ugen);
 		VM_WARN_ON(current->wait_zone_ugen);
 	}
+
+	lufd_check_queued_pages();
 }
 
 /*
@@ -853,8 +855,10 @@ bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 		struct luf_batch *lb;
 		unsigned long lb_ugen;
 
-		if (!luf_key)
+		if (!luf_key) {
+			lufd_check_pages(page, buddy_order(page));
 			return true;
+		}
 
 		lb = &luf_batch[luf_key];
 		read_lock_irqsave(&lb->lock, flags);
@@ -875,12 +879,15 @@ bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 
 		if (!current->luf_ugen || ugen_before(current->luf_ugen, lb_ugen))
 			current->luf_ugen = lb_ugen;
+		lufd_queue_page_for_check(page, buddy_order(page));
 		return true;
 	}
 
 	zone_ugen = page_zone_ugen(zone, page);
-	if (!zone_ugen)
+	if (!zone_ugen) {
+		lufd_check_pages(page, buddy_order(page));
 		return true;
+	}
 
 	/*
 	 * Should not be zero since zone-zone_ugen has been updated in
@@ -888,17 +895,23 @@ bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 	 */
 	VM_WARN_ON(!zone->zone_ugen);
 
-	if (!ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen))
+	if (!ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen)) {
+		lufd_check_pages(page, buddy_order(page));
 		return true;
+	}
 
 	if (current->luf_no_shootdown)
 		return false;
 
+	lufd_check_zone_pages(zone, page, buddy_order(page));
+
 	/*
 	 * zone batched flush has been already set.
 	 */
-	if (current->zone_ugen)
+	if (current->zone_ugen) {
+		lufd_queue_page_for_check(page, buddy_order(page));
 		return true;
+	}
 
 	/*
 	 * Others are already performing tlb shootdown for us.  All we
@@ -933,6 +946,7 @@ bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 		atomic_long_set(&zone->nr_luf_pages, 0);
 		fold_batch(tlb_ubc_takeoff, &zone->zone_batch, true);
 	}
+	lufd_queue_page_for_check(page, buddy_order(page));
 	return true;
 }
 #endif
@@ -1238,6 +1252,11 @@ static inline void __free_one_page(struct page *page,
 	} else
 		zone_ugen = page_zone_ugen(zone, page);
 
+	if (!zone_ugen)
+		lufd_check_pages(page, order);
+	else
+		lufd_check_zone_pages(zone, page, order);
+
 	while (order < MAX_PAGE_ORDER) {
 		int buddy_mt = migratetype;
 		unsigned long buddy_zone_ugen;
@@ -1299,6 +1318,10 @@ static inline void __free_one_page(struct page *page,
 		set_page_zone_ugen(page, zone_ugen);
 		pfn = combined_pfn;
 		order++;
+		if (!zone_ugen)
+			lufd_check_pages(page, order);
+		else
+			lufd_check_zone_pages(zone, page, order);
 	}
 
 done_merging:
@@ -3168,6 +3191,8 @@ void free_frozen_pages(struct page *page, unsigned int order,
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
+	lufd_mark_pages(page, order, luf_key);
+
 	if (!pcp_allowed_order(order)) {
 		__free_pages_ok(page, order, FPI_NONE, luf_key);
 		return;
@@ -3220,6 +3245,7 @@ void free_unref_folios(struct folio_batch *folios, unsigned short luf_key)
 		unsigned long pfn = folio_pfn(folio);
 		unsigned int order = folio_order(folio);
 
+		lufd_mark_folio(folio, luf_key);
 		if (!free_pages_prepare(&folio->page, order))
 			continue;
 		/*
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 641d93f6af4c1..be40bc2a93378 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -89,6 +89,9 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
 #ifdef CONFIG_PAGE_TABLE_CHECK
 	&page_table_check_ops,
 #endif
+#ifdef CONFIG_LUF_DEBUG
+	&luf_debug_ops,
+#endif
 };
 
 unsigned long page_ext_size;
diff --git a/mm/rmap.c b/mm/rmap.c
index df350b4dfddd0..6a6188d47031b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1161,6 +1161,235 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 }
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
+#ifdef CONFIG_LUF_DEBUG
+
+static bool need_luf_debug(void)
+{
+	return true;
+}
+
+static void init_luf_debug(void)
+{
+	/* Do nothing */
+}
+
+struct page_ext_operations luf_debug_ops = {
+	.size = sizeof(struct luf_batch),
+	.need = need_luf_debug,
+	.init = init_luf_debug,
+	.need_shared_flags = false,
+};
+
+static bool __lufd_check_zone_pages(struct page *page, int nr,
+		struct tlbflush_unmap_batch *batch, unsigned long ugen)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		struct page_ext *page_ext;
+		struct luf_batch *lb;
+		unsigned long lb_ugen;
+		unsigned long flags;
+		bool ret;
+
+		page_ext = page_ext_get(page + i);
+		if (!page_ext)
+			continue;
+
+		lb = (struct luf_batch *)page_ext_data(page_ext, &luf_debug_ops);
+		write_lock_irqsave(&lb->lock, flags);
+		lb_ugen = lb->ugen;
+		ret = arch_tlbbatch_done(&lb->batch.arch, &batch->arch);
+		write_unlock_irqrestore(&lb->lock, flags);
+		page_ext_put(page_ext);
+
+		if (!ret || ugen_before(ugen, lb_ugen))
+			return false;
+	}
+	return true;
+}
+
+void lufd_check_zone_pages(struct zone *zone, struct page *page, unsigned int order)
+{
+	bool warn;
+	static bool once = false;
+
+	if (!page || !zone)
+		return;
+
+	warn = !__lufd_check_zone_pages(page, 1 << order,
+			&zone->zone_batch, zone->luf_ugen);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) order(%u)\n",
+				atomic_long_read(&luf_ugen), page, order);
+		print_lufd_arch();
+	}
+}
+
+static bool __lufd_check_pages(const struct page *page, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		struct page_ext *page_ext;
+		struct luf_batch *lb;
+		unsigned long lb_ugen;
+		unsigned long flags;
+		bool ret;
+
+		page_ext = page_ext_get(page + i);
+		if (!page_ext)
+			continue;
+
+		lb = (struct luf_batch *)page_ext_data(page_ext, &luf_debug_ops);
+		write_lock_irqsave(&lb->lock, flags);
+		lb_ugen = lb->ugen;
+		ret = arch_tlbbatch_diet(&lb->batch.arch, lb_ugen);
+		write_unlock_irqrestore(&lb->lock, flags);
+		page_ext_put(page_ext);
+
+		if (!ret)
+			return false;
+	}
+	return true;
+}
+
+void lufd_queue_page_for_check(struct page *page, int order)
+{
+	struct page **parray = current->lufd_pages;
+	int *oarray = current->lufd_pages_order;
+
+	if (!page)
+		return;
+
+	if (current->lufd_pages_nr >= NR_LUFD_PAGES) {
+		VM_WARN_ONCE(1, "LUFD: NR_LUFD_PAGES is too small.\n");
+		return;
+	}
+
+	*(parray + current->lufd_pages_nr) = page;
+	*(oarray + current->lufd_pages_nr) = order;
+	current->lufd_pages_nr++;
+}
+
+void lufd_check_queued_pages(void)
+{
+	struct page **parray = current->lufd_pages;
+	int *oarray = current->lufd_pages_order;
+	int i;
+
+	for (i = 0; i < current->lufd_pages_nr; i++)
+		lufd_check_pages(*(parray + i), *(oarray + i));
+	current->lufd_pages_nr = 0;
+}
+
+void lufd_check_folio(struct folio *folio)
+{
+	struct page *page;
+	int nr;
+	bool warn;
+	static bool once = false;
+
+	if (!folio)
+		return;
+
+	page = folio_page(folio, 0);
+	nr = folio_nr_pages(folio);
+
+	warn = !__lufd_check_pages(page, nr);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) nr(%d)\n",
+				atomic_long_read(&luf_ugen), page, nr);
+		print_lufd_arch();
+	}
+}
+EXPORT_SYMBOL(lufd_check_folio);
+
+void lufd_check_pages(const struct page *page, unsigned int order)
+{
+	bool warn;
+	static bool once = false;
+
+	if (!page)
+		return;
+
+	warn = !__lufd_check_pages(page, 1 << order);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) order(%u)\n",
+				atomic_long_read(&luf_ugen), page, order);
+		print_lufd_arch();
+	}
+}
+EXPORT_SYMBOL(lufd_check_pages);
+
+static void __lufd_mark_pages(struct page *page, int nr, unsigned short luf_key)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		struct page_ext *page_ext;
+		struct luf_batch *lb;
+
+		page_ext = page_ext_get(page + i);
+		if (!page_ext)
+			continue;
+
+		lb = (struct luf_batch *)page_ext_data(page_ext, &luf_debug_ops);
+		fold_luf_batch(lb, &luf_batch[luf_key]);
+		page_ext_put(page_ext);
+	}
+}
+
+void lufd_mark_folio(struct folio *folio, unsigned short luf_key)
+{
+	struct page *page;
+	int nr;
+	bool warn;
+	static bool once = false;
+
+	if (!luf_key)
+		return;
+
+	page = folio_page(folio, 0);
+	nr = folio_nr_pages(folio);
+
+	warn = !__lufd_check_pages(page, nr);
+	__lufd_mark_pages(page, nr, luf_key);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) nr(%d)\n",
+				atomic_long_read(&luf_ugen), page, nr);
+		print_lufd_arch();
+	}
+}
+
+void lufd_mark_pages(struct page *page, unsigned int order, unsigned short luf_key)
+{
+	bool warn;
+	static bool once = false;
+
+	if (!luf_key)
+		return;
+
+	warn = !__lufd_check_pages(page, 1 << order);
+	__lufd_mark_pages(page, 1 << order, luf_key);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) order(%u)\n",
+				atomic_long_read(&luf_ugen), page, order);
+		print_lufd_arch();
+	}
+}
+#endif
+
 /**
  * page_address_in_vma - The virtual address of a page in this VMA.
  * @folio: The folio containing the page.
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* RFC v12 rebased on mm-unstable as of Feb 21, 2025
  2025-02-20 23:37     ` Byungchul Park
  2025-02-26 11:30       ` RFC v12 rebased on v6.14-rc4 Byungchul Park
@ 2025-02-26 11:33       ` Byungchul Park
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
  1 sibling, 1 reply; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 11:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Dave Hansen, linux-kernel, linux-mm, kernel_team, akpm,
	ying.huang, vernhao, mgorman, hughd, willy, david, peterz, luto,
	tglx, mingo, bp, rjgolo

On Fri, Feb 21, 2025 at 08:37:10AM +0900, Byungchul Park wrote:
> On Thu, Feb 20, 2025 at 04:29:51PM +0100, Vlastimil Babka wrote:
> > On 2/20/25 16:15, Dave Hansen wrote:
> > > On 2/19/25 21:20, Byungchul Park wrote:
> > >> I'm posting the latest version so that anyone can try luf mechanism if
> > >> wanted by any chance.  However, I tagged RFC again because there are
> > >> still issues that should be resolved to merge to mainline:
> > > 
> > > I don't see anything fundamentally different here from the last 11
> > > versions. I think the entire approach is dangerous and basically makes
> > > things impossible to debug. It's not clear that some of the failure
> > > scenarios that I've brought up in the past have actually been fixed.
> > 
> > Yes, and it's still an invasive change to the buddy allocator.
> 
> Didn't want.. but admit.
> 
> > IIRC at Plumbers the opinion in the audience was that there might be ways to
> > improve the batching on unmap to reduce the flushes without such an invasive
> > and potentially dangerous change? Has that been investigated?
> 
> Sure.  I tried like, by holding those pages not freed until either no
> one accesses the interesting pages or memory pressure is high.  However,
> unfortunately it was super hard to fix performance degradation by the
> number of page reclaim increased due to the unfreed pages.
> 
> > Also "Rebase on akpm/mm.git mm-unstable(5a7056135b) as of Nov 22, 2024." is
> > very outdated at this point?
> 
> Sorry for that.  I will rebase and share.

This is the same patch set but rebased on akpm/mm.git
mm-unstable(f7ed46277aa) as of Feb 21, 2025.

	Byungchul

> 	Byungchul
> > 
> > Thanks,
> > Vlastimil
> > 
> > > What I've said here still stands:
> > > 
> > >> https://lore.kernel.org/all/fab1dd64-c652-4160-93b4-7b483a8874da@intel.com/
> > > 
> > >> I think tglx would call all of this "tinkering".  The approach to this
> > >> series is to "fix" narrow, specific cases that reviewers point out, make
> > >> it compile, then send it out again, hoping someone will apply it.
> > >> 
> > >> So, for me, until the approach to this series changes: NAK, for x86.
> > >> Andrew, please don't take this series.  Or, if you do, please drop the
> > >> patch enabling it on x86.
> > > 
> > > I think I'd also like to stop being cc'd on this. If LUF is merged into
> > > mainline and proven to work on arm64 or riscv for a year, I'd be happy
> > > to take another look at enabling it on x86. I think that's just about
> > > the only thing that would make me reconsider.
> > > 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25]  x86/tlb: add APIs manipulating tlb batch's arch data
  2025-02-26 11:33       ` RFC v12 rebased on mm-unstable as of Feb 21, 2025 Byungchul Park
@ 2025-02-26 12:01         ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 02/25] arm64/tlbflush: " Byungchul Park
                             ` (23 more replies)
  0 siblings, 24 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read-only and were unmapped, since
the contents of the folios wouldn't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

This is a preparation for the mechanism that needs to recognize
read-only tlb entries by separating tlb batch arch data into two, one is
for read-only entries and the other is for writable ones, and merging
those two when needed.

It also optimizes tlb shootdown by skipping CPUs that have already
performed tlb flush needed since.  To support it, added APIs
manipulating arch data for x86.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/x86/include/asm/tlbflush.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 29373da7b00a6..52c54ca68ca9e 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -5,6 +5,7 @@
 #include <linux/mm_types.h>
 #include <linux/mmu_notifier.h>
 #include <linux/sched.h>
+#include <linux/cpumask.h>
 
 #include <asm/processor.h>
 #include <asm/cpufeature.h>
@@ -293,6 +294,29 @@ static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
+static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
+{
+	cpumask_clear(&batch->cpumask);
+}
+
+static inline void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
+		struct arch_tlbflush_unmap_batch *bsrc)
+{
+	cpumask_or(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+}
+
+static inline bool arch_tlbbatch_need_fold(struct arch_tlbflush_unmap_batch *batch,
+		struct mm_struct *mm)
+{
+	return !cpumask_subset(mm_cpumask(mm), &batch->cpumask);
+}
+
+static inline bool arch_tlbbatch_done(struct arch_tlbflush_unmap_batch *bdst,
+		struct arch_tlbflush_unmap_batch *bsrc)
+{
+	return !cpumask_andnot(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+}
+
 static inline bool pte_flags_need_flush(unsigned long oldflags,
 					unsigned long newflags,
 					bool ignore_access)

base-commit: f7ed46277aaa8f848f18959ff68469f5186ba87c
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 02/25]  arm64/tlbflush: add APIs manipulating tlb batch's arch data
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 03/25] riscv/tlb: " Byungchul Park
                             ` (22 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

This is a preparation for the mechanism that requires to manipulate tlb
batch's arch data.  Even though arm64 does nothing for tlb things, arch
with CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH should provide the APIs.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/arm64/include/asm/tlbflush.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index b7e1920570bdd..f7036cd33e35c 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -347,6 +347,33 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	dsb(ish);
 }
 
+static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
+{
+	/* nothing to do */
+}
+
+static inline void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
+			       struct arch_tlbflush_unmap_batch *bsrc)
+{
+	/* nothing to do */
+}
+
+static inline bool arch_tlbbatch_need_fold(struct arch_tlbflush_unmap_batch *batch,
+			       struct mm_struct *mm)
+{
+	/*
+	 * Nothing is needed in this architecture.
+	 */
+	return false;
+}
+
+static inline bool arch_tlbbatch_done(struct arch_tlbflush_unmap_batch *bdst,
+			       struct arch_tlbflush_unmap_batch *bsrc)
+{
+	/* Kernel can consider tlb batch always has been done. */
+	return true;
+}
+
 /*
  * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
  * necessarily a performance improvement.
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 03/25]  riscv/tlb: add APIs manipulating tlb batch's arch data
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 02/25] arm64/tlbflush: " Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 04/25] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_flush() Byungchul Park
                             ` (21 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

This is a preparation for the mechanism that needs to recognize
read-only tlb entries by separating tlb batch arch data into two, one is
for read-only entries and the other is for writable ones, and merging
those two when needed.

It also optimizes tlb shootdown by skipping CPUs that have already
performed tlb flush needed since.  To support it, added APIs
manipulating arch data for riscv.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/riscv/include/asm/tlbflush.h | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index ce0dd0fed7646..cecd8e7e2a3bd 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -8,6 +8,7 @@
 #define _ASM_RISCV_TLBFLUSH_H
 
 #include <linux/mm_types.h>
+#include <linux/cpumask.h>
 #include <asm/smp.h>
 #include <asm/errata_list.h>
 
@@ -64,6 +65,33 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 void arch_flush_tlb_batched_pending(struct mm_struct *mm);
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
+static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
+{
+	cpumask_clear(&batch->cpumask);
+
+}
+
+static inline void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
+		struct arch_tlbflush_unmap_batch *bsrc)
+{
+	cpumask_or(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+
+}
+
+static inline bool arch_tlbbatch_need_fold(struct arch_tlbflush_unmap_batch *batch,
+		struct mm_struct *mm)
+{
+	return !cpumask_subset(mm_cpumask(mm), &batch->cpumask);
+
+}
+
+static inline bool arch_tlbbatch_done(struct arch_tlbflush_unmap_batch *bdst,
+		struct arch_tlbflush_unmap_batch *bsrc)
+{
+	return !cpumask_andnot(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask);
+
+}
+
 extern unsigned long tlb_flush_all_threshold;
 #else /* CONFIG_MMU */
 #define local_flush_tlb_all()			do { } while (0)
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 04/25]  x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_flush()
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 02/25] arm64/tlbflush: " Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 03/25] riscv/tlb: " Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 05/25] mm/buddy: make room for a new variable, luf_key, in struct page Byungchul Park
                             ` (20 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

This is a preparation for the mechanism that requires to avoid redundant
tlb flush by manipulating tlb batch's arch data.  To achieve that, we
need to separate the part clearing the tlb batch's arch data out of
arch_tlbbatch_flush().

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/riscv/mm/tlbflush.c | 1 -
 arch/x86/mm/tlb.c        | 2 --
 mm/rmap.c                | 1 +
 3 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 74dd9307fbf1b..38f4bea8a964a 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -200,5 +200,4 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	__flush_tlb_range(&batch->cpumask, FLUSH_TLB_NO_ASID, 0,
 			  FLUSH_TLB_MAX_SIZE, PAGE_SIZE);
-	cpumask_clear(&batch->cpumask);
 }
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6cf881a942bbe..523e8bb6fba1f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1292,8 +1292,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 		local_irq_enable();
 	}
 
-	cpumask_clear(&batch->cpumask);
-
 	put_flush_tlb_info();
 	put_cpu();
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index bcec8677f68df..546b7a6a30a44 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -648,6 +648,7 @@ void try_to_unmap_flush(void)
 		return;
 
 	arch_tlbbatch_flush(&tlb_ubc->arch);
+	arch_tlbbatch_clear(&tlb_ubc->arch);
 	tlb_ubc->flush_required = false;
 	tlb_ubc->writable = false;
 }
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 05/25]  mm/buddy: make room for a new variable, luf_key, in struct page
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (2 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 04/25] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_flush() Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 06/25] mm: move should_skip_kasan_poison() to mm/internal.h Byungchul Park
                             ` (19 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
tracks need of tlb flush for each page residing in buddy.

Since the private field in struct page is used only to store page order
in buddy, ranging from 0 to MAX_PAGE_ORDER, that can be covered with
unsigned short.  So splitted it into two smaller ones, order and luf_key,
so that the both can be used in buddy at the same time.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm_types.h | 42 +++++++++++++++++++++++++++++++++-------
 mm/internal.h            |  4 ++--
 mm/page_alloc.c          |  2 +-
 3 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 689b2a7461892..7b15efbe9f529 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -107,13 +107,27 @@ struct page {
 				pgoff_t index;		/* Our offset within mapping. */
 				unsigned long share;	/* share count for fsdax */
 			};
-			/**
-			 * @private: Mapping-private opaque data.
-			 * Usually used for buffer_heads if PagePrivate.
-			 * Used for swp_entry_t if swapcache flag set.
-			 * Indicates order in the buddy system if PageBuddy.
-			 */
-			unsigned long private;
+			union {
+				/**
+				 * @private: Mapping-private opaque data.
+				 * Usually used for buffer_heads if PagePrivate.
+				 * Used for swp_entry_t if swapcache flag set.
+				 * Indicates order in the buddy system if PageBuddy.
+				 */
+				unsigned long private;
+				struct {
+					/*
+					 * Indicates order in the buddy system if PageBuddy.
+					 */
+					unsigned short order;
+
+					/*
+					 * For tracking need of tlb flush,
+					 * by luf(lazy unmap flush).
+					 */
+					unsigned short luf_key;
+				};
+			};
 		};
 		struct {	/* page_pool used by netstack */
 			/**
@@ -577,6 +591,20 @@ static inline void set_page_private(struct page *page, unsigned long private)
 	page->private = private;
 }
 
+#define page_buddy_order(page)		((page)->order)
+
+static inline void set_page_buddy_order(struct page *page, unsigned int order)
+{
+	page->order = (unsigned short)order;
+}
+
+#define page_luf_key(page)		((page)->luf_key)
+
+static inline void set_page_luf_key(struct page *page, unsigned short luf_key)
+{
+	page->luf_key = luf_key;
+}
+
 static inline void *folio_get_private(struct folio *folio)
 {
 	return folio->private;
diff --git a/mm/internal.h b/mm/internal.h
index b07550db2bfd1..c4d2018a7cf8e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -543,7 +543,7 @@ struct alloc_context {
 static inline unsigned int buddy_order(struct page *page)
 {
 	/* PageBuddy() must be checked by the caller */
-	return page_private(page);
+	return page_buddy_order(page);
 }
 
 /*
@@ -557,7 +557,7 @@ static inline unsigned int buddy_order(struct page *page)
  * times, potentially observing different values in the tests and the actual
  * use of the result.
  */
-#define buddy_order_unsafe(page)	READ_ONCE(page_private(page))
+#define buddy_order_unsafe(page)	READ_ONCE(page_buddy_order(page))
 
 /*
  * This function checks whether a page is free && is the buddy
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 16dfcf7ade74a..86c9fa45d36fe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -576,7 +576,7 @@ void prep_compound_page(struct page *page, unsigned int order)
 
 static inline void set_buddy_order(struct page *page, unsigned int order)
 {
-	set_page_private(page, order);
+	set_page_buddy_order(page, order);
 	__SetPageBuddy(page);
 }
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 06/25] mm: move should_skip_kasan_poison() to mm/internal.h
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (3 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 05/25] mm/buddy: make room for a new variable, luf_key, in struct page Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 07/25] mm: introduce luf_ugen to be used as a global timestamp Byungchul Park
                             ` (18 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
needs to use should_skip_kasan_poison() function in mm/internal.h.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/internal.h   | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c | 47 -----------------------------------------------
 2 files changed, 47 insertions(+), 47 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c4d2018a7cf8e..ee8af97c39f59 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1067,8 +1067,55 @@ static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
 DECLARE_STATIC_KEY_TRUE(deferred_pages);
 
 bool __init deferred_grow_zone(struct zone *zone, unsigned int order);
+
+static inline bool deferred_pages_enabled(void)
+{
+	return static_branch_unlikely(&deferred_pages);
+}
+#else
+static inline bool deferred_pages_enabled(void)
+{
+	return false;
+}
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
+/*
+ * Skip KASAN memory poisoning when either:
+ *
+ * 1. For generic KASAN: deferred memory initialization has not yet completed.
+ *    Tag-based KASAN modes skip pages freed via deferred memory initialization
+ *    using page tags instead (see below).
+ * 2. For tag-based KASAN modes: the page has a match-all KASAN tag, indicating
+ *    that error detection is disabled for accesses via the page address.
+ *
+ * Pages will have match-all tags in the following circumstances:
+ *
+ * 1. Pages are being initialized for the first time, including during deferred
+ *    memory init; see the call to page_kasan_tag_reset in __init_single_page.
+ * 2. The allocation was not unpoisoned due to __GFP_SKIP_KASAN, with the
+ *    exception of pages unpoisoned by kasan_unpoison_vmalloc.
+ * 3. The allocation was excluded from being checked due to sampling,
+ *    see the call to kasan_unpoison_pages.
+ *
+ * Poisoning pages during deferred memory init will greatly lengthen the
+ * process and cause problem in large memory systems as the deferred pages
+ * initialization is done with interrupt disabled.
+ *
+ * Assuming that there will be no reference to those newly initialized
+ * pages before they are ever allocated, this should have no effect on
+ * KASAN memory tracking as the poison will be properly inserted at page
+ * allocation time. The only corner case is when pages are allocated by
+ * on-demand allocation and then freed again before the deferred pages
+ * initialization is done, but this is not likely to happen.
+ */
+static inline bool should_skip_kasan_poison(struct page *page)
+{
+	if (IS_ENABLED(CONFIG_KASAN_GENERIC))
+		return deferred_pages_enabled();
+
+	return page_kasan_tag(page) == KASAN_TAG_KERNEL;
+}
+
 enum mminit_level {
 	MMINIT_WARNING,
 	MMINIT_VERIFY,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 86c9fa45d36fe..f3930a2a05cd3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -299,11 +299,6 @@ int page_group_by_mobility_disabled __read_mostly;
  */
 DEFINE_STATIC_KEY_TRUE(deferred_pages);
 
-static inline bool deferred_pages_enabled(void)
-{
-	return static_branch_unlikely(&deferred_pages);
-}
-
 /*
  * deferred_grow_zone() is __init, but it is called from
  * get_page_from_freelist() during early boot until deferred_pages permanently
@@ -316,11 +311,6 @@ _deferred_grow_zone(struct zone *zone, unsigned int order)
 	return deferred_grow_zone(zone, order);
 }
 #else
-static inline bool deferred_pages_enabled(void)
-{
-	return false;
-}
-
 static inline bool _deferred_grow_zone(struct zone *zone, unsigned int order)
 {
 	return false;
@@ -993,43 +983,6 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 	return ret;
 }
 
-/*
- * Skip KASAN memory poisoning when either:
- *
- * 1. For generic KASAN: deferred memory initialization has not yet completed.
- *    Tag-based KASAN modes skip pages freed via deferred memory initialization
- *    using page tags instead (see below).
- * 2. For tag-based KASAN modes: the page has a match-all KASAN tag, indicating
- *    that error detection is disabled for accesses via the page address.
- *
- * Pages will have match-all tags in the following circumstances:
- *
- * 1. Pages are being initialized for the first time, including during deferred
- *    memory init; see the call to page_kasan_tag_reset in __init_single_page.
- * 2. The allocation was not unpoisoned due to __GFP_SKIP_KASAN, with the
- *    exception of pages unpoisoned by kasan_unpoison_vmalloc.
- * 3. The allocation was excluded from being checked due to sampling,
- *    see the call to kasan_unpoison_pages.
- *
- * Poisoning pages during deferred memory init will greatly lengthen the
- * process and cause problem in large memory systems as the deferred pages
- * initialization is done with interrupt disabled.
- *
- * Assuming that there will be no reference to those newly initialized
- * pages before they are ever allocated, this should have no effect on
- * KASAN memory tracking as the poison will be properly inserted at page
- * allocation time. The only corner case is when pages are allocated by
- * on-demand allocation and then freed again before the deferred pages
- * initialization is done, but this is not likely to happen.
- */
-static inline bool should_skip_kasan_poison(struct page *page)
-{
-	if (IS_ENABLED(CONFIG_KASAN_GENERIC))
-		return deferred_pages_enabled();
-
-	return page_kasan_tag(page) == KASAN_TAG_KERNEL;
-}
-
 static void kernel_init_pages(struct page *page, int numpages)
 {
 	int i;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 07/25] mm: introduce luf_ugen to be used as a global timestamp
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (4 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 06/25] mm: move should_skip_kasan_poison() to mm/internal.h Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 08/25] mm: introduce luf_batch to be used as hash table to store luf meta data Byungchul Park
                             ` (17 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
needs to evaluate the temporal sequence of events to determine whether
tlb flush required has been done on each CPU.

To achieve that, this patch introduced a generation number, luf_ugen,
and a few APIs manipulating the number.  It's worth noting the number is
designed to wraparound so care must be taken when using it.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm.h | 34 ++++++++++++++++++++++++++++++++++
 mm/rmap.c          | 22 ++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d82feabbe44f8..74a37cb132caa 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4240,4 +4240,38 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
 int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
 int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+/*
+ * luf_ugen will start with 2 so that 1 can be regarded as a passed one.
+ */
+#define LUF_UGEN_INIT 2
+
+static inline bool ugen_before(unsigned long a, unsigned long b)
+{
+	/*
+	 * Consider wraparound.
+	 */
+	return (long)(a - b) < 0;
+}
+
+static inline unsigned long next_ugen(unsigned long ugen)
+{
+	if (ugen + 1)
+		return ugen + 1;
+	/*
+	 * Avoid invalid ugen, zero.
+	 */
+	return ugen + 2;
+}
+
+static inline unsigned long prev_ugen(unsigned long ugen)
+{
+	if (ugen - 1)
+		return ugen - 1;
+	/*
+	 * Avoid invalid ugen, zero.
+	 */
+	return ugen - 2;
+}
+#endif
 #endif /* _LINUX_MM_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index 546b7a6a30a44..8439dbb194c8c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -634,6 +634,28 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
 }
 
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
+/*
+ * This generation number is primarily used as a global timestamp to
+ * determine whether tlb flush required has been done on each CPU.  The
+ * function, ugen_before(), should be used to evaluate the temporal
+ * sequence of events because the number is designed to wraparound.
+ */
+static atomic_long_t __maybe_unused luf_ugen = ATOMIC_LONG_INIT(LUF_UGEN_INIT);
+
+/*
+ * Don't return invalid luf_ugen, zero.
+ */
+static unsigned long __maybe_unused new_luf_ugen(void)
+{
+	unsigned long ugen = atomic_long_inc_return(&luf_ugen);
+
+	if (!ugen)
+		ugen = atomic_long_inc_return(&luf_ugen);
+
+	return ugen;
+}
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 08/25] mm: introduce luf_batch to be used as hash table to store luf meta data
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (5 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 07/25] mm: introduce luf_ugen to be used as a global timestamp Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 09/25] mm: introduce API to perform tlb shootdown on exit from page allocator Byungchul Park
                             ` (16 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
needs to keep luf meta data per page while staying in pcp or buddy
allocator.  The meta data includes cpumask for tlb shootdown and luf's
request generation number.

Since struct page doesn't have enough room to store luf meta data, this
patch introduces a hash table to store them and makes each page keep its
hash key instead.

Since all the pages in pcp or buddy share the hash table, confliction is
inevitable so care must be taken when reading or updating its entry.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm_types.h |  10 ++++
 mm/internal.h            |   8 +++
 mm/rmap.c                | 122 +++++++++++++++++++++++++++++++++++++--
 3 files changed, 136 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7b15efbe9f529..f52d4e49e8736 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -33,6 +33,16 @@
 struct address_space;
 struct mem_cgroup;
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+struct luf_batch {
+	struct tlbflush_unmap_batch batch;
+	unsigned long ugen;
+	rwlock_t lock;
+};
+#else
+struct luf_batch {};
+#endif
+
 /*
  * Each physical page in the system has a struct page associated with
  * it to keep track of whatever it is we are using the page for at the
diff --git a/mm/internal.h b/mm/internal.h
index ee8af97c39f59..8ade04255dba3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1270,6 +1270,8 @@ extern struct workqueue_struct *mm_percpu_wq;
 void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
 void flush_tlb_batched_pending(struct mm_struct *mm);
+void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset);
+void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src);
 #else
 static inline void try_to_unmap_flush(void)
 {
@@ -1280,6 +1282,12 @@ static inline void try_to_unmap_flush_dirty(void)
 static inline void flush_tlb_batched_pending(struct mm_struct *mm)
 {
 }
+static inline void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset)
+{
+}
+static inline void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
+{
+}
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
 extern const struct trace_print_flags pageflag_names[];
diff --git a/mm/rmap.c b/mm/rmap.c
index 8439dbb194c8c..ac450a45257f6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -641,7 +641,7 @@ struct anon_vma *folio_lock_anon_vma_read(const struct folio *folio,
  * function, ugen_before(), should be used to evaluate the temporal
  * sequence of events because the number is designed to wraparound.
  */
-static atomic_long_t __maybe_unused luf_ugen = ATOMIC_LONG_INIT(LUF_UGEN_INIT);
+static atomic_long_t luf_ugen = ATOMIC_LONG_INIT(LUF_UGEN_INIT);
 
 /*
  * Don't return invalid luf_ugen, zero.
@@ -656,6 +656,122 @@ static unsigned long __maybe_unused new_luf_ugen(void)
 	return ugen;
 }
 
+static void reset_batch(struct tlbflush_unmap_batch *batch)
+{
+	arch_tlbbatch_clear(&batch->arch);
+	batch->flush_required = false;
+	batch->writable = false;
+}
+
+void fold_batch(struct tlbflush_unmap_batch *dst,
+		struct tlbflush_unmap_batch *src, bool reset)
+{
+	if (!src->flush_required)
+		return;
+
+	/*
+	 * Fold src to dst.
+	 */
+	arch_tlbbatch_fold(&dst->arch, &src->arch);
+	dst->writable = dst->writable || src->writable;
+	dst->flush_required = true;
+
+	if (!reset)
+		return;
+
+	/*
+	 * Reset src.
+	 */
+	reset_batch(src);
+}
+
+/*
+ * The range that luf_key covers, which is 'unsigned short' type.
+ */
+#define NR_LUF_BATCH (1 << (sizeof(short) * 8))
+
+/*
+ * Use 0th entry as accumulated batch.
+ */
+static struct luf_batch luf_batch[NR_LUF_BATCH];
+
+static void luf_batch_init(struct luf_batch *lb)
+{
+	rwlock_init(&lb->lock);
+	reset_batch(&lb->batch);
+	lb->ugen = atomic_long_read(&luf_ugen) - 1;
+}
+
+static int __init luf_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_LUF_BATCH; i++)
+		luf_batch_init(&luf_batch[i]);
+
+	return 0;
+}
+early_initcall(luf_init);
+
+/*
+ * key to point an entry of the luf_batch array
+ *
+ * note: zero means invalid key
+ */
+static atomic_t luf_kgen = ATOMIC_INIT(1);
+
+/*
+ * Don't return invalid luf_key, zero.
+ */
+static unsigned short __maybe_unused new_luf_key(void)
+{
+	unsigned short luf_key = atomic_inc_return(&luf_kgen);
+
+	if (!luf_key)
+		luf_key = atomic_inc_return(&luf_kgen);
+
+	return luf_key;
+}
+
+static void __fold_luf_batch(struct luf_batch *dst_lb,
+		struct tlbflush_unmap_batch *src_batch,
+		unsigned long src_ugen)
+{
+	/*
+	 * dst_lb->ugen represents one that requires tlb shootdown for
+	 * it, that is, sort of request number.  The newer it is, the
+	 * more tlb shootdown might be needed to fulfill the newer
+	 * request.  Conservertively keep the newer one.
+	 */
+	if (!dst_lb->ugen || ugen_before(dst_lb->ugen, src_ugen))
+		dst_lb->ugen = src_ugen;
+	fold_batch(&dst_lb->batch, src_batch, false);
+}
+
+void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
+{
+	unsigned long flags;
+
+	/*
+	 * Exactly same.  Nothing to fold.
+	 */
+	if (dst == src)
+		return;
+
+	if (&src->lock < &dst->lock) {
+		read_lock_irqsave(&src->lock, flags);
+		write_lock(&dst->lock);
+	} else {
+		write_lock_irqsave(&dst->lock, flags);
+		read_lock(&src->lock);
+	}
+
+	__fold_luf_batch(dst, &src->batch, src->ugen);
+
+	write_unlock(&dst->lock);
+	read_unlock_irqrestore(&src->lock, flags);
+}
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
@@ -670,9 +786,7 @@ void try_to_unmap_flush(void)
 		return;
 
 	arch_tlbbatch_flush(&tlb_ubc->arch);
-	arch_tlbbatch_clear(&tlb_ubc->arch);
-	tlb_ubc->flush_required = false;
-	tlb_ubc->writable = false;
+	reset_batch(tlb_ubc);
 }
 
 /* Flush iff there are potentially writable TLB entries that can race with IO */
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 09/25] mm: introduce API to perform tlb shootdown on exit from page allocator
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (6 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 08/25] mm: introduce luf_batch to be used as hash table to store luf meta data Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 10/25] mm: introduce APIs to check if the page allocation is tlb shootdownable Byungchul Park
                             ` (15 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
performs tlb shootdown required on exit from page allocator.

This patch introduced a new API rather than making use of existing
try_to_unmap_flush() to avoid repeated and redundant tlb shootdown due
to frequent page allocations during a session of batched unmap flush.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/sched.h |  1 +
 mm/internal.h         |  4 ++++
 mm/rmap.c             | 20 ++++++++++++++++++++
 3 files changed, 25 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9632e3318e0d6..86ef426644639 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1401,6 +1401,7 @@ struct task_struct {
 #endif
 
 	struct tlbflush_unmap_batch	tlb_ubc;
+	struct tlbflush_unmap_batch	tlb_ubc_takeoff;
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/mm/internal.h b/mm/internal.h
index 8ade04255dba3..8ad7e86c1c0e2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1269,6 +1269,7 @@ extern struct workqueue_struct *mm_percpu_wq;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
+void try_to_unmap_flush_takeoff(void);
 void flush_tlb_batched_pending(struct mm_struct *mm);
 void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset);
 void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src);
@@ -1279,6 +1280,9 @@ static inline void try_to_unmap_flush(void)
 static inline void try_to_unmap_flush_dirty(void)
 {
 }
+static inline void try_to_unmap_flush_takeoff(void)
+{
+}
 static inline void flush_tlb_batched_pending(struct mm_struct *mm)
 {
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index ac450a45257f6..61366b4570c9a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -772,6 +772,26 @@ void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
 	read_unlock_irqrestore(&src->lock, flags);
 }
 
+void try_to_unmap_flush_takeoff(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
+
+	if (!tlb_ubc_takeoff->flush_required)
+		return;
+
+	arch_tlbbatch_flush(&tlb_ubc_takeoff->arch);
+
+	/*
+	 * Now that tlb shootdown of tlb_ubc_takeoff has been performed,
+	 * it's good chance to shrink tlb_ubc if possible.
+	 */
+	if (arch_tlbbatch_done(&tlb_ubc->arch, &tlb_ubc_takeoff->arch))
+		reset_batch(tlb_ubc);
+
+	reset_batch(tlb_ubc_takeoff);
+}
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 10/25] mm: introduce APIs to check if the page allocation is tlb shootdownable
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (7 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 09/25] mm: introduce API to perform tlb shootdown on exit from page allocator Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 11/25] mm: deliver luf_key to pcp or buddy on free after unmapping Byungchul Park
                             ` (14 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
should indentify if tlb shootdown can be performed on page allocation.

In a context with irq disabled or non-task, tlb shootdown cannot be
performed because of deadlock issue.  Thus, page allocator should work
being aware of whether tlb shootdown can be performed on returning page.

This patch introduced APIs that pcp or buddy page allocator can use to
delimit the critical sections taking off pages and indentify whether
tlb shootdown can be performed.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/sched.h |   5 ++
 mm/internal.h         |  14 ++++
 mm/page_alloc.c       | 159 ++++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c             |   2 +-
 4 files changed, 179 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 86ef426644639..a3049ea5b3ad3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1400,6 +1400,11 @@ struct task_struct {
 	struct callback_head		cid_work;
 #endif
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+	int luf_no_shootdown;
+	int luf_takeoff_started;
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 	struct tlbflush_unmap_batch	tlb_ubc_takeoff;
 
diff --git a/mm/internal.h b/mm/internal.h
index 8ad7e86c1c0e2..bf16482bce2f5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1598,6 +1598,20 @@ static inline void accept_page(struct page *page)
 {
 }
 #endif /* CONFIG_UNACCEPTED_MEMORY */
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+extern struct luf_batch luf_batch[];
+bool luf_takeoff_start(void);
+void luf_takeoff_end(void);
+bool luf_takeoff_no_shootdown(void);
+bool luf_takeoff_check(struct page *page);
+bool luf_takeoff_check_and_fold(struct page *page);
+#else
+static inline bool luf_takeoff_start(void) { return false; }
+static inline void luf_takeoff_end(void) {}
+static inline bool luf_takeoff_no_shootdown(void) { return true; }
+static inline bool luf_takeoff_check(struct page *page) { return true; }
+static inline bool luf_takeoff_check_and_fold(struct page *page) { return true; }
+#endif
 
 /* pagewalk.c */
 int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3930a2a05cd3..f3cb02e36e770 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -622,6 +622,165 @@ compaction_capture(struct capture_control *capc, struct page *page,
 }
 #endif /* CONFIG_COMPACTION */
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+static bool no_shootdown_context(void)
+{
+	/*
+	 * If it performs with irq disabled, that might cause a deadlock.
+	 * Avoid tlb shootdown in this case.
+	 */
+	return !(!irqs_disabled() && in_task());
+}
+
+/*
+ * Can be called with zone lock released and irq enabled.
+ */
+bool luf_takeoff_start(void)
+{
+	unsigned long flags;
+	bool no_shootdown = no_shootdown_context();
+
+	local_irq_save(flags);
+
+	/*
+	 * It's the outmost luf_takeoff_start().
+	 */
+	if (!current->luf_takeoff_started)
+		VM_WARN_ON(current->luf_no_shootdown);
+
+	/*
+	 * current->luf_no_shootdown > 0 doesn't mean tlb shootdown is
+	 * not allowed at all.  However, it guarantees tlb shootdown is
+	 * possible once current->luf_no_shootdown == 0.  It might look
+	 * too conservative but for now do this way for simplity.
+	 */
+	if (no_shootdown || current->luf_no_shootdown)
+		current->luf_no_shootdown++;
+
+	current->luf_takeoff_started++;
+	local_irq_restore(flags);
+
+	return !no_shootdown;
+}
+
+/*
+ * Should be called within the same context of luf_takeoff_start().
+ */
+void luf_takeoff_end(void)
+{
+	unsigned long flags;
+	bool no_shootdown;
+	bool outmost = false;
+
+	local_irq_save(flags);
+	VM_WARN_ON(!current->luf_takeoff_started);
+
+	/*
+	 * Assume the context and irq flags are same as those at
+	 * luf_takeoff_start().
+	 */
+	if (current->luf_no_shootdown)
+		current->luf_no_shootdown--;
+
+	no_shootdown = !!current->luf_no_shootdown;
+
+	current->luf_takeoff_started--;
+
+	/*
+	 * It's the outmost luf_takeoff_end().
+	 */
+	if (!current->luf_takeoff_started)
+		outmost = true;
+
+	local_irq_restore(flags);
+
+	if (no_shootdown)
+		goto out;
+
+	try_to_unmap_flush_takeoff();
+out:
+	if (outmost)
+		VM_WARN_ON(current->luf_no_shootdown);
+}
+
+/*
+ * Can be called with zone lock released and irq enabled.
+ */
+bool luf_takeoff_no_shootdown(void)
+{
+	bool no_shootdown = true;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	/*
+	 * No way.  Delimit using luf_takeoff_{start,end}().
+	 */
+	if (unlikely(!current->luf_takeoff_started)) {
+		VM_WARN_ON(1);
+		goto out;
+	}
+	no_shootdown = current->luf_no_shootdown;
+out:
+	local_irq_restore(flags);
+	return no_shootdown;
+}
+
+/*
+ * Should be called with either zone lock held and irq disabled or pcp
+ * lock held.
+ */
+bool luf_takeoff_check(struct page *page)
+{
+	unsigned short luf_key = page_luf_key(page);
+
+	/*
+	 * No way.  Delimit using luf_takeoff_{start,end}().
+	 */
+	if (unlikely(!current->luf_takeoff_started)) {
+		VM_WARN_ON(1);
+		return false;
+	}
+
+	if (!luf_key)
+		return true;
+
+	return !current->luf_no_shootdown;
+}
+
+/*
+ * Should be called with either zone lock held and irq disabled or pcp
+ * lock held.
+ */
+bool luf_takeoff_check_and_fold(struct page *page)
+{
+	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
+	unsigned short luf_key = page_luf_key(page);
+	struct luf_batch *lb;
+	unsigned long flags;
+
+	/*
+	 * No way.  Delimit using luf_takeoff_{start,end}().
+	 */
+	if (unlikely(!current->luf_takeoff_started)) {
+		VM_WARN_ON(1);
+		return false;
+	}
+
+	if (!luf_key)
+		return true;
+
+	if (current->luf_no_shootdown)
+		return false;
+
+	lb = &luf_batch[luf_key];
+	read_lock_irqsave(&lb->lock, flags);
+	fold_batch(tlb_ubc_takeoff, &lb->batch, false);
+	read_unlock_irqrestore(&lb->lock, flags);
+	return true;
+}
+#endif
+
 static inline void account_freepages(struct zone *zone, int nr_pages,
 				     int migratetype)
 {
diff --git a/mm/rmap.c b/mm/rmap.c
index 61366b4570c9a..40de03c8f73be 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -693,7 +693,7 @@ void fold_batch(struct tlbflush_unmap_batch *dst,
 /*
  * Use 0th entry as accumulated batch.
  */
-static struct luf_batch luf_batch[NR_LUF_BATCH];
+struct luf_batch luf_batch[NR_LUF_BATCH];
 
 static void luf_batch_init(struct luf_batch *lb)
 {
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 11/25] mm: deliver luf_key to pcp or buddy on free after unmapping
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (8 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 10/25] mm: introduce APIs to check if the page allocation is tlb shootdownable Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 12/25] mm: delimit critical sections to take off pages from pcp or buddy alloctor Byungchul Park
                             ` (13 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
needs to pass luf_key to pcp or buddy allocator on free after unmapping
e.g. during page reclaim or page migration.

The luf_key will be used to track need of tlb shootdown and which cpus
need to perform tlb flush, per page residing in pcp or buddy, and should
be handed over properly when pages travel between pcp and buddy.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/internal.h        |   4 +-
 mm/page_alloc.c      | 116 ++++++++++++++++++++++++++++++++-----------
 mm/page_frag_cache.c |   6 +--
 mm/page_isolation.c  |   6 +++
 mm/page_reporting.c  |   6 +++
 mm/slub.c            |   2 +-
 mm/swap.c            |   4 +-
 mm/vmscan.c          |   8 +--
 8 files changed, 111 insertions(+), 41 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index bf16482bce2f5..fe1c879b41487 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -746,8 +746,8 @@ struct page *__alloc_frozen_pages_noprof(gfp_t, unsigned int order, int nid,
 		nodemask_t *);
 #define __alloc_frozen_pages(...) \
 	alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
-void free_frozen_pages(struct page *page, unsigned int order);
-void free_unref_folios(struct folio_batch *fbatch);
+void free_frozen_pages(struct page *page, unsigned int order, unsigned short luf_key);
+void free_unref_folios(struct folio_batch *fbatch, unsigned short luf_key);
 
 #ifdef CONFIG_NUMA
 struct page *alloc_frozen_pages_noprof(gfp_t, unsigned int order);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3cb02e36e770..986fdd57e8e3a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -212,7 +212,7 @@ unsigned int pageblock_order __read_mostly;
 #endif
 
 static void __free_pages_ok(struct page *page, unsigned int order,
-			    fpi_t fpi_flags);
+			    fpi_t fpi_flags, unsigned short luf_key);
 
 /*
  * results with 256, 32 in the lowmem_reserve sysctl:
@@ -850,8 +850,13 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
 
 	list_del(&page->buddy_list);
 	__ClearPageBuddy(page);
-	set_page_private(page, 0);
 	zone->free_area[order].nr_free--;
+
+	/*
+	 * Keep head page's private until post_alloc_hook().
+	 *
+	 * XXX: Tail pages' private doesn't get cleared.
+	 */
 }
 
 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
@@ -920,7 +925,7 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
 static inline void __free_one_page(struct page *page,
 		unsigned long pfn,
 		struct zone *zone, unsigned int order,
-		int migratetype, fpi_t fpi_flags)
+		int migratetype, fpi_t fpi_flags, unsigned short luf_key)
 {
 	struct capture_control *capc = task_capc(zone);
 	unsigned long buddy_pfn = 0;
@@ -937,10 +942,21 @@ static inline void __free_one_page(struct page *page,
 
 	account_freepages(zone, 1 << order, migratetype);
 
+	/*
+	 * Use the page's luf_key unchanged if luf_key == 0.  Worth
+	 * noting that page_luf_key() will be 0 in most cases since it's
+	 * initialized at free_pages_prepare().
+	 */
+	if (luf_key)
+		set_page_luf_key(page, luf_key);
+	else
+		luf_key = page_luf_key(page);
+
 	while (order < MAX_PAGE_ORDER) {
 		int buddy_mt = migratetype;
+		unsigned short buddy_luf_key;
 
-		if (compaction_capture(capc, page, order, migratetype)) {
+		if (!luf_key && compaction_capture(capc, page, order, migratetype)) {
 			account_freepages(zone, -(1 << order), migratetype);
 			return;
 		}
@@ -973,6 +989,18 @@ static inline void __free_one_page(struct page *page,
 		else
 			__del_page_from_free_list(buddy, zone, order, buddy_mt);
 
+		/*
+		 * !buddy_luf_key && !luf_key : do nothing
+		 *  buddy_luf_key && !luf_key : luf_key = buddy_luf_key
+		 * !buddy_luf_key &&  luf_key : do nothing
+		 *  buddy_luf_key &&  luf_key : merge two into luf_key
+		 */
+		buddy_luf_key = page_luf_key(buddy);
+		if (buddy_luf_key && !luf_key)
+			luf_key = buddy_luf_key;
+		else if (buddy_luf_key && luf_key)
+			fold_luf_batch(&luf_batch[luf_key], &luf_batch[buddy_luf_key]);
+
 		if (unlikely(buddy_mt != migratetype)) {
 			/*
 			 * Match buddy type. This ensures that an
@@ -984,6 +1012,7 @@ static inline void __free_one_page(struct page *page,
 
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
+		set_page_luf_key(page, luf_key);
 		pfn = combined_pfn;
 		order++;
 	}
@@ -1242,6 +1271,11 @@ __always_inline bool free_pages_prepare(struct page *page,
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
+	/*
+	 * Ensure private is zero before using it inside allocator.
+	 */
+	set_page_private(page, 0);
+
 	trace_mm_page_free(page, order);
 	kmsan_free_page(page, order);
 
@@ -1407,7 +1441,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			count -= nr_pages;
 			pcp->count -= nr_pages;
 
-			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
+			__free_one_page(page, pfn, zone, order, mt, FPI_NONE, 0);
+
 			trace_mm_page_pcpu_drain(page, order, mt);
 		} while (count > 0 && !list_empty(list));
 	}
@@ -1431,7 +1466,7 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 	do {
 		int mt = get_pfnblock_migratetype(page, pfn);
 
-		__free_one_page(page, pfn, zone, order, mt, fpi);
+		__free_one_page(page, pfn, zone, order, mt, fpi, 0);
 		pfn += 1 << order;
 		if (pfn == end)
 			break;
@@ -1441,11 +1476,18 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 
 static void free_one_page(struct zone *zone, struct page *page,
 			  unsigned long pfn, unsigned int order,
-			  fpi_t fpi_flags)
+			  fpi_t fpi_flags, unsigned short luf_key)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&zone->lock, flags);
+
+	/*
+	 * valid luf_key can be passed only if order == 0.
+	 */
+	VM_WARN_ON(luf_key && order);
+	set_page_luf_key(page, luf_key);
+
 	split_large_buddy(zone, page, pfn, order, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
@@ -1453,13 +1495,13 @@ static void free_one_page(struct zone *zone, struct page *page,
 }
 
 static void __free_pages_ok(struct page *page, unsigned int order,
-			    fpi_t fpi_flags)
+			    fpi_t fpi_flags, unsigned short luf_key)
 {
 	unsigned long pfn = page_to_pfn(page);
 	struct zone *zone = page_zone(page);
 
 	if (free_pages_prepare(page, order))
-		free_one_page(zone, page, pfn, order, fpi_flags);
+		free_one_page(zone, page, pfn, order, fpi_flags, luf_key);
 }
 
 void __meminit __free_pages_core(struct page *page, unsigned int order,
@@ -1507,7 +1549,7 @@ void __meminit __free_pages_core(struct page *page, unsigned int order,
 	 * Bypass PCP and place fresh pages right to the tail, primarily
 	 * relevant for memory onlining.
 	 */
-	__free_pages_ok(page, order, FPI_TO_TAIL);
+	__free_pages_ok(page, order, FPI_TO_TAIL, 0);
 }
 
 /*
@@ -2504,6 +2546,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		if (unlikely(page == NULL))
 			break;
 
+		/*
+		 * Keep the page's luf_key.
+		 */
+
 		/*
 		 * Split buddy pages returned by expand() are received here in
 		 * physical page order. The page is added to the tail of
@@ -2785,12 +2831,14 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 
 static void free_frozen_page_commit(struct zone *zone,
 		struct per_cpu_pages *pcp, struct page *page, int migratetype,
-		unsigned int order)
+		unsigned int order, unsigned short luf_key)
 {
 	int high, batch;
 	int pindex;
 	bool free_high = false;
 
+	set_page_luf_key(page, luf_key);
+
 	/*
 	 * On freeing, reduce the number of pages that are batch allocated.
 	 * See nr_pcp_alloc() where alloc_factor is increased for subsequent
@@ -2799,7 +2847,16 @@ static void free_frozen_page_commit(struct zone *zone,
 	pcp->alloc_factor >>= 1;
 	__count_vm_events(PGFREE, 1 << order);
 	pindex = order_to_pindex(migratetype, order);
-	list_add(&page->pcp_list, &pcp->lists[pindex]);
+
+	/*
+	 * Defer tlb shootdown as much as possible by putting luf'd
+	 * pages to the tail.
+	 */
+	if (luf_key)
+		list_add_tail(&page->pcp_list, &pcp->lists[pindex]);
+	else
+		list_add(&page->pcp_list, &pcp->lists[pindex]);
+
 	pcp->count += 1 << order;
 
 	batch = READ_ONCE(pcp->batch);
@@ -2834,7 +2891,8 @@ static void free_frozen_page_commit(struct zone *zone,
 /*
  * Free a pcp page
  */
-void free_frozen_pages(struct page *page, unsigned int order)
+void free_frozen_pages(struct page *page, unsigned int order,
+		     unsigned short luf_key)
 {
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp;
@@ -2843,7 +2901,7 @@ void free_frozen_pages(struct page *page, unsigned int order)
 	int migratetype;
 
 	if (!pcp_allowed_order(order)) {
-		__free_pages_ok(page, order, FPI_NONE);
+		__free_pages_ok(page, order, FPI_NONE, luf_key);
 		return;
 	}
 
@@ -2861,7 +2919,7 @@ void free_frozen_pages(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, pfn, order, FPI_NONE);
+			free_one_page(zone, page, pfn, order, FPI_NONE, luf_key);
 			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
@@ -2870,10 +2928,10 @@ void free_frozen_pages(struct page *page, unsigned int order)
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
-		free_frozen_page_commit(zone, pcp, page, migratetype, order);
+		free_frozen_page_commit(zone, pcp, page, migratetype, order, luf_key);
 		pcp_spin_unlock(pcp);
 	} else {
-		free_one_page(zone, page, pfn, order, FPI_NONE);
+		free_one_page(zone, page, pfn, order, FPI_NONE, luf_key);
 	}
 	pcp_trylock_finish(UP_flags);
 }
@@ -2881,7 +2939,7 @@ void free_frozen_pages(struct page *page, unsigned int order)
 /*
  * Free a batch of folios
  */
-void free_unref_folios(struct folio_batch *folios)
+void free_unref_folios(struct folio_batch *folios, unsigned short luf_key)
 {
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp = NULL;
@@ -2902,7 +2960,7 @@ void free_unref_folios(struct folio_batch *folios)
 		 */
 		if (!pcp_allowed_order(order)) {
 			free_one_page(folio_zone(folio), &folio->page,
-				      pfn, order, FPI_NONE);
+				      pfn, order, FPI_NONE, luf_key);
 			continue;
 		}
 		folio->private = (void *)(unsigned long)order;
@@ -2938,7 +2996,7 @@ void free_unref_folios(struct folio_batch *folios)
 			 */
 			if (is_migrate_isolate(migratetype)) {
 				free_one_page(zone, &folio->page, pfn,
-					      order, FPI_NONE);
+					      order, FPI_NONE, luf_key);
 				continue;
 			}
 
@@ -2951,7 +3009,7 @@ void free_unref_folios(struct folio_batch *folios)
 			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
 				free_one_page(zone, &folio->page, pfn,
-					      order, FPI_NONE);
+					      order, FPI_NONE, luf_key);
 				continue;
 			}
 			locked_zone = zone;
@@ -2966,7 +3024,7 @@ void free_unref_folios(struct folio_batch *folios)
 
 		trace_mm_page_free_batched(&folio->page);
 		free_frozen_page_commit(zone, pcp, &folio->page, migratetype,
-				order);
+				order, luf_key);
 	}
 
 	if (pcp) {
@@ -3058,7 +3116,7 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
 
 	/* Return isolated page to tail of freelist. */
 	__free_one_page(page, page_to_pfn(page), zone, order, mt,
-			FPI_SKIP_REPORT_NOTIFY | FPI_TO_TAIL);
+			FPI_SKIP_REPORT_NOTIFY | FPI_TO_TAIL, 0);
 }
 
 /*
@@ -4944,7 +5002,7 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
 out:
 	if (memcg_kmem_online() && (gfp & __GFP_ACCOUNT) && page &&
 	    unlikely(__memcg_kmem_charge_page(page, gfp, order) != 0)) {
-		free_frozen_pages(page, order);
+		free_frozen_pages(page, order, 0);
 		page = NULL;
 	}
 
@@ -5024,11 +5082,11 @@ void __free_pages(struct page *page, unsigned int order)
 	int head = PageHead(page);
 
 	if (put_page_testzero(page))
-		free_frozen_pages(page, order);
+		free_frozen_pages(page, order, 0);
 	else if (!head) {
 		pgalloc_tag_sub_pages(page, (1 << order) - 1);
 		while (order-- > 0)
-			free_frozen_pages(page + (1 << order), order);
+			free_frozen_pages(page + (1 << order), order, 0);
 	}
 }
 EXPORT_SYMBOL(__free_pages);
@@ -5059,7 +5117,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
 
 		last = page + (1UL << order);
 		for (page += nr; page < last; page++)
-			__free_pages_ok(page, 0, FPI_TO_TAIL);
+			__free_pages_ok(page, 0, FPI_TO_TAIL, 0);
 	}
 	return (void *)addr;
 }
@@ -7077,7 +7135,7 @@ bool put_page_back_buddy(struct page *page)
 		int migratetype = get_pfnblock_migratetype(page, pfn);
 
 		ClearPageHWPoisonTakenOff(page);
-		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
+		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE, 0);
 		if (TestClearPageHWPoison(page)) {
 			ret = true;
 		}
@@ -7146,7 +7204,7 @@ static void __accept_page(struct zone *zone, unsigned long *flags,
 
 	accept_memory(page_to_phys(page), PAGE_SIZE << MAX_PAGE_ORDER);
 
-	__free_pages_ok(page, MAX_PAGE_ORDER, FPI_TO_TAIL);
+	__free_pages_ok(page, MAX_PAGE_ORDER, FPI_TO_TAIL, 0);
 
 	if (last)
 		static_branch_dec(&zones_with_unaccepted_pages);
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index d2423f30577e4..558622f15a81e 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -86,7 +86,7 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
 
 	if (page_ref_sub_and_test(page, count))
-		free_frozen_pages(page, compound_order(page));
+		free_frozen_pages(page, compound_order(page), 0);
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
@@ -139,7 +139,7 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
 
 		if (unlikely(encoded_page_decode_pfmemalloc(encoded_page))) {
 			free_frozen_pages(page,
-					encoded_page_decode_order(encoded_page));
+					encoded_page_decode_order(encoded_page), 0);
 			goto refill;
 		}
 
@@ -166,6 +166,6 @@ void page_frag_free(void *addr)
 	struct page *page = virt_to_head_page(addr);
 
 	if (unlikely(put_page_testzero(page)))
-		free_frozen_pages(page, compound_order(page));
+		free_frozen_pages(page, compound_order(page), 0);
 }
 EXPORT_SYMBOL(page_frag_free);
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b2fc5266e3d26..ac45a5f4e7b9f 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -265,6 +265,12 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 		WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype));
 	} else {
 		set_pageblock_migratetype(page, migratetype);
+
+		/*
+		 * Do not clear the page's private to keep its luf_key
+		 * unchanged.
+		 */
+
 		__putback_isolated_page(page, order, migratetype);
 	}
 	zone->nr_isolate_pageblock--;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index e4c428e61d8c1..c05afb7a395f1 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -116,6 +116,12 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
 		int mt = get_pageblock_migratetype(page);
 		unsigned int order = get_order(sg->length);
 
+		/*
+		 * Ensure private is zero before putting into the
+		 * allocator.
+		 */
+		set_page_private(page, 0);
+
 		__putback_isolated_page(page, order, mt);
 
 		/* If the pages were not reported due to error skip flagging */
diff --git a/mm/slub.c b/mm/slub.c
index 184fd2b147584..812b24ed16ea1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2665,7 +2665,7 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
 	__folio_clear_slab(folio);
 	mm_account_reclaimed_pages(pages);
 	unaccount_slab(slab, order, s);
-	free_frozen_pages(&folio->page, order);
+	free_frozen_pages(&folio->page, order, 0);
 }
 
 static void rcu_free_slab(struct rcu_head *h)
diff --git a/mm/swap.c b/mm/swap.c
index 7523b65d8caa6..bdfede631aea9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -109,7 +109,7 @@ void __folio_put(struct folio *folio)
 	page_cache_release(folio);
 	folio_unqueue_deferred_split(folio);
 	mem_cgroup_uncharge(folio);
-	free_frozen_pages(&folio->page, folio_order(folio));
+	free_frozen_pages(&folio->page, folio_order(folio), 0);
 }
 EXPORT_SYMBOL(__folio_put);
 
@@ -989,7 +989,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 
 	folios->nr = j;
 	mem_cgroup_uncharge_folios(folios);
-	free_unref_folios(folios);
+	free_unref_folios(folios, 0);
 }
 EXPORT_SYMBOL(folios_put_refs);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fcca38bc640f5..c8a995a3380ac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1525,7 +1525,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if (folio_batch_add(&free_folios, folio) == 0) {
 			mem_cgroup_uncharge_folios(&free_folios);
 			try_to_unmap_flush();
-			free_unref_folios(&free_folios);
+			free_unref_folios(&free_folios, 0);
 		}
 		continue;
 
@@ -1594,7 +1594,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 	mem_cgroup_uncharge_folios(&free_folios);
 	try_to_unmap_flush();
-	free_unref_folios(&free_folios);
+	free_unref_folios(&free_folios, 0);
 
 	list_splice(&ret_folios, folio_list);
 	count_vm_events(PGACTIVATE, pgactivate);
@@ -1918,7 +1918,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 			if (folio_batch_add(&free_folios, folio) == 0) {
 				spin_unlock_irq(&lruvec->lru_lock);
 				mem_cgroup_uncharge_folios(&free_folios);
-				free_unref_folios(&free_folios);
+				free_unref_folios(&free_folios, 0);
 				spin_lock_irq(&lruvec->lru_lock);
 			}
 
@@ -1940,7 +1940,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 	if (free_folios.nr) {
 		spin_unlock_irq(&lruvec->lru_lock);
 		mem_cgroup_uncharge_folios(&free_folios);
-		free_unref_folios(&free_folios);
+		free_unref_folios(&free_folios, 0);
 		spin_lock_irq(&lruvec->lru_lock);
 	}
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 12/25] mm: delimit critical sections to take off pages from pcp or buddy alloctor
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (9 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 11/25] mm: deliver luf_key to pcp or buddy on free after unmapping Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 13/25] mm: introduce pend_list in struct free_area to track luf'd pages Byungchul Park
                             ` (12 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Now that luf mechanism has been introduced, tlb shootdown might be
necessary when luf'd pages exit from pcp or buddy allocator.  Check if
it's okay to take off pages and can perform for luf'd pages before use.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/compaction.c     | 32 ++++++++++++++++--
 mm/internal.h       |  2 +-
 mm/page_alloc.c     | 79 +++++++++++++++++++++++++++++++++++++++++++--
 mm/page_isolation.c |  4 ++-
 mm/page_reporting.c | 20 +++++++++++-
 5 files changed, 129 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e5744f354edea..bf5ded83b9dd1 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -606,6 +606,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 
 	page = pfn_to_page(blockpfn);
 
+	luf_takeoff_start();
 	/* Isolate free pages. */
 	for (; blockpfn < end_pfn; blockpfn += stride, page += stride) {
 		int isolated;
@@ -654,9 +655,12 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 				goto isolate_fail;
 		}
 
+		if (!luf_takeoff_check(page))
+			goto isolate_fail;
+
 		/* Found a free page, will break it into order-0 pages */
 		order = buddy_order(page);
-		isolated = __isolate_free_page(page, order);
+		isolated = __isolate_free_page(page, order, false);
 		if (!isolated)
 			break;
 		set_page_private(page, order);
@@ -684,6 +688,11 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	if (locked)
 		spin_unlock_irqrestore(&cc->zone->lock, flags);
 
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
+
 	/*
 	 * Be careful to not go outside of the pageblock.
 	 */
@@ -1591,6 +1600,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 		if (!area->nr_free)
 			continue;
 
+		luf_takeoff_start();
 		spin_lock_irqsave(&cc->zone->lock, flags);
 		freelist = &area->free_list[MIGRATE_MOVABLE];
 		list_for_each_entry_reverse(freepage, freelist, buddy_list) {
@@ -1598,6 +1608,10 @@ static void fast_isolate_freepages(struct compact_control *cc)
 
 			order_scanned++;
 			nr_scanned++;
+
+			if (!luf_takeoff_check(freepage))
+				goto scan_next;
+
 			pfn = page_to_pfn(freepage);
 
 			if (pfn >= highest)
@@ -1617,7 +1631,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 				/* Shorten the scan if a candidate is found */
 				limit >>= 1;
 			}
-
+scan_next:
 			if (order_scanned >= limit)
 				break;
 		}
@@ -1635,7 +1649,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 
 		/* Isolate the page if available */
 		if (page) {
-			if (__isolate_free_page(page, order)) {
+			if (__isolate_free_page(page, order, false)) {
 				set_page_private(page, order);
 				nr_isolated = 1 << order;
 				nr_scanned += nr_isolated - 1;
@@ -1652,6 +1666,11 @@ static void fast_isolate_freepages(struct compact_control *cc)
 
 		spin_unlock_irqrestore(&cc->zone->lock, flags);
 
+		/*
+		 * Check and flush before using the pages taken off.
+		 */
+		luf_takeoff_end();
+
 		/* Skip fast search if enough freepages isolated */
 		if (cc->nr_freepages >= cc->nr_migratepages)
 			break;
@@ -2373,7 +2392,14 @@ static enum compact_result compact_finished(struct compact_control *cc)
 {
 	int ret;
 
+	/*
+	 * luf_takeoff_{start,end}() is required to identify whether
+	 * this compaction context is tlb shootdownable for luf'd pages.
+	 */
+	luf_takeoff_start();
 	ret = __compact_finished(cc);
+	luf_takeoff_end();
+
 	trace_mm_compaction_finished(cc->zone, cc->order, ret);
 	if (ret == COMPACT_NO_SUITABLE_PAGE)
 		ret = COMPACT_CONTINUE;
diff --git a/mm/internal.h b/mm/internal.h
index fe1c879b41487..77b7e6d0bcc29 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -666,7 +666,7 @@ static inline void clear_zone_contiguous(struct zone *zone)
 	zone->contiguous = false;
 }
 
-extern int __isolate_free_page(struct page *page, unsigned int order);
+extern int __isolate_free_page(struct page *page, unsigned int order, bool willputback);
 extern void __putback_isolated_page(struct page *page, unsigned int order,
 				    int mt);
 extern void memblock_free_pages(struct page *page, unsigned long pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 986fdd57e8e3a..a0182421da13e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -869,8 +869,13 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 static inline struct page *get_page_from_free_area(struct free_area *area,
 					    int migratetype)
 {
-	return list_first_entry_or_null(&area->free_list[migratetype],
+	struct page *page = list_first_entry_or_null(&area->free_list[migratetype],
 					struct page, buddy_list);
+
+	if (page && luf_takeoff_check(page))
+		return page;
+
+	return NULL;
 }
 
 /*
@@ -1653,6 +1658,8 @@ static __always_inline void page_del_and_expand(struct zone *zone,
 	int nr_pages = 1 << high;
 
 	__del_page_from_free_list(page, zone, high, migratetype);
+	if (unlikely(!luf_takeoff_check_and_fold(page)))
+		VM_WARN_ON(1);
 	nr_pages -= expand(zone, page, low, high, migratetype);
 	account_freepages(zone, -nr_pages, migratetype);
 }
@@ -2023,6 +2030,13 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
 
 		del_page_from_free_list(buddy, zone, order,
 					get_pfnblock_migratetype(buddy, pfn));
+
+		/*
+		 * No need to luf_takeoff_check_and_fold() since it's
+		 * going back to buddy. luf_key will be handed over in
+		 * split_large_buddy().
+		 */
+
 		set_pageblock_migratetype(page, migratetype);
 		split_large_buddy(zone, buddy, pfn, order, FPI_NONE);
 		return true;
@@ -2034,6 +2048,13 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
 
 		del_page_from_free_list(page, zone, order,
 					get_pfnblock_migratetype(page, pfn));
+
+		/*
+		 * No need to luf_takeoff_check_and_fold() since it's
+		 * going back to buddy. luf_key will be handed over in
+		 * split_large_buddy().
+		 */
+
 		set_pageblock_migratetype(page, migratetype);
 		split_large_buddy(zone, page, pfn, order, FPI_NONE);
 		return true;
@@ -2166,6 +2187,8 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
 		unsigned int nr_added;
 
 		del_page_from_free_list(page, zone, current_order, block_type);
+		if (unlikely(!luf_takeoff_check_and_fold(page)))
+			VM_WARN_ON(1);
 		change_pageblock_range(page, current_order, start_type);
 		nr_added = expand(zone, page, order, current_order, start_type);
 		account_freepages(zone, nr_added, start_type);
@@ -2246,6 +2269,9 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 		if (free_area_empty(area, fallback_mt))
 			continue;
 
+		if (luf_takeoff_no_shootdown())
+			continue;
+
 		if (can_steal_fallback(order, migratetype))
 			*can_steal = true;
 
@@ -2337,6 +2363,11 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 					pageblock_nr_pages)
 			continue;
 
+		/*
+		 * luf_takeoff_{start,end}() is required for
+		 * get_page_from_free_area() to use luf_takeoff_check().
+		 */
+		luf_takeoff_start();
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < NR_PAGE_ORDERS; order++) {
 			struct free_area *area = &(zone->free_area[order]);
@@ -2394,10 +2425,12 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			WARN_ON_ONCE(ret == -1);
 			if (ret > 0) {
 				spin_unlock_irqrestore(&zone->lock, flags);
+				luf_takeoff_end();
 				return ret;
 			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
+		luf_takeoff_end();
 	}
 
 	return false;
@@ -2539,6 +2572,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	unsigned long flags;
 	int i;
 
+	luf_takeoff_start();
 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
@@ -2563,6 +2597,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		list_add_tail(&page->pcp_list, list);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 
 	return i;
 }
@@ -3057,7 +3095,7 @@ void split_page(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL_GPL(split_page);
 
-int __isolate_free_page(struct page *page, unsigned int order)
+int __isolate_free_page(struct page *page, unsigned int order, bool willputback)
 {
 	struct zone *zone = page_zone(page);
 	int mt = get_pageblock_migratetype(page);
@@ -3076,6 +3114,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	del_page_from_free_list(page, zone, order, mt);
+	if (unlikely(!willputback && !luf_takeoff_check_and_fold(page)))
+		VM_WARN_ON(1);
 
 	/*
 	 * Set the pageblock if the isolated page is at least half of a
@@ -3155,6 +3195,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 	do {
 		page = NULL;
+		luf_takeoff_start();
 		spin_lock_irqsave(&zone->lock, flags);
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
@@ -3172,10 +3213,15 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 			if (!page) {
 				spin_unlock_irqrestore(&zone->lock, flags);
+				luf_takeoff_end();
 				return NULL;
 			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
+		/*
+		 * Check and flush before using the pages taken off.
+		 */
+		luf_takeoff_end();
 	} while (check_new_pages(page, order));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3259,6 +3305,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		}
 
 		page = list_first_entry(list, struct page, pcp_list);
+		if (!luf_takeoff_check_and_fold(page))
+			return NULL;
 		list_del(&page->pcp_list);
 		pcp->count -= 1 << order;
 	} while (check_new_pages(page, order));
@@ -3276,11 +3324,13 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct page *page;
 	unsigned long __maybe_unused UP_flags;
 
+	luf_takeoff_start();
 	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (!pcp) {
 		pcp_trylock_finish(UP_flags);
+		luf_takeoff_end();
 		return NULL;
 	}
 
@@ -3294,6 +3344,10 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
 	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone, 1);
@@ -4892,6 +4946,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	if (unlikely(!zone))
 		goto failed;
 
+	luf_takeoff_start();
 	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
@@ -4927,6 +4982,10 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 
 	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
 	zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_account);
@@ -4936,6 +4995,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 
 failed_irq:
 	pcp_trylock_finish(UP_flags);
+	luf_takeoff_end();
 
 failed:
 	page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
@@ -6989,6 +7049,7 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 
 	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
+	luf_takeoff_start();
 	spin_lock_irqsave(&zone->lock, flags);
 	while (pfn < end_pfn) {
 		page = pfn_to_page(pfn);
@@ -7017,9 +7078,15 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 		VM_WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE);
 		order = buddy_order(page);
 		del_page_from_free_list(page, zone, order, MIGRATE_ISOLATE);
+		if (unlikely(!luf_takeoff_check_and_fold(page)))
+			VM_WARN_ON(1);
 		pfn += (1 << order);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 
 	return end_pfn - start_pfn - already_offline;
 }
@@ -7095,6 +7162,7 @@ bool take_page_off_buddy(struct page *page)
 	unsigned int order;
 	bool ret = false;
 
+	luf_takeoff_start();
 	spin_lock_irqsave(&zone->lock, flags);
 	for (order = 0; order < NR_PAGE_ORDERS; order++) {
 		struct page *page_head = page - (pfn & ((1 << order) - 1));
@@ -7107,6 +7175,8 @@ bool take_page_off_buddy(struct page *page)
 
 			del_page_from_free_list(page_head, zone, page_order,
 						migratetype);
+			if (unlikely(!luf_takeoff_check_and_fold(page_head)))
+				VM_WARN_ON(1);
 			break_down_buddy_pages(zone, page_head, page, 0,
 						page_order, migratetype);
 			SetPageHWPoisonTakenOff(page);
@@ -7117,6 +7187,11 @@ bool take_page_off_buddy(struct page *page)
 			break;
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
+
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
 	return ret;
 }
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index ac45a5f4e7b9f..521ed32bdbf67 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -218,6 +218,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	struct page *buddy;
 
 	zone = page_zone(page);
+	luf_takeoff_start();
 	spin_lock_irqsave(&zone->lock, flags);
 	if (!is_migrate_isolate_page(page))
 		goto out;
@@ -236,7 +237,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 			buddy = find_buddy_page_pfn(page, page_to_pfn(page),
 						    order, NULL);
 			if (buddy && !is_migrate_isolate_page(buddy)) {
-				isolated_page = !!__isolate_free_page(page, order);
+				isolated_page = !!__isolate_free_page(page, order, true);
 				/*
 				 * Isolating a free page in an isolated pageblock
 				 * is expected to always work as watermarks don't
@@ -276,6 +277,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	zone->nr_isolate_pageblock--;
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
+	luf_takeoff_end(zone);
 }
 
 static inline struct page *
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index c05afb7a395f1..03a7f5f6dc073 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -167,6 +167,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	if (list_empty(list))
 		return err;
 
+	luf_takeoff_start();
 	spin_lock_irq(&zone->lock);
 
 	/*
@@ -191,6 +192,11 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		if (PageReported(page))
 			continue;
 
+		if (!luf_takeoff_check(page)) {
+			VM_WARN_ON(1);
+			continue;
+		}
+
 		/*
 		 * If we fully consumed our budget then update our
 		 * state to indicate that we are requesting additional
@@ -204,7 +210,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 
 		/* Attempt to pull page from list and place in scatterlist */
 		if (*offset) {
-			if (!__isolate_free_page(page, order)) {
+			if (!__isolate_free_page(page, order, false)) {
 				next = page;
 				break;
 			}
@@ -227,6 +233,11 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		/* release lock before waiting on report processing */
 		spin_unlock_irq(&zone->lock);
 
+		/*
+		 * Check and flush before using the pages taken off.
+		 */
+		luf_takeoff_end();
+
 		/* begin processing pages in local list */
 		err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
 
@@ -236,6 +247,8 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		/* update budget to reflect call to report function */
 		budget--;
 
+		luf_takeoff_start();
+
 		/* reacquire zone lock and resume processing */
 		spin_lock_irq(&zone->lock);
 
@@ -259,6 +272,11 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 
 	spin_unlock_irq(&zone->lock);
 
+	/*
+	 * Check and flush before using the pages taken off.
+	 */
+	luf_takeoff_end();
+
 	return err;
 }
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 13/25] mm: introduce pend_list in struct free_area to track luf'd pages
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (10 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 12/25] mm: delimit critical sections to take off pages from pcp or buddy alloctor Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 14/25] mm/rmap: recognize read-only tlb entries during batched tlb flush Byungchul Park
                             ` (11 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

luf'd pages requires tlb shootdown on exiting from page allocator. For
some page allocation request, it's okay to return luf'd page followed by
tlb shootdown but it's not okay for e.g. irq context.

This patch splitted the list in free_area into two, 'free_list' for
non-luf'd pages and 'pend_list' for luf'd pages so that the buddy
allocator can work better with various conditions of context.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mmzone.h  |   3 ++
 kernel/power/snapshot.c |  14 ++++++
 kernel/vmcore_info.c    |   2 +
 mm/compaction.c         |  33 ++++++++++---
 mm/internal.h           |  17 ++++++-
 mm/mm_init.c            |   2 +
 mm/page_alloc.c         | 105 ++++++++++++++++++++++++++++++++++------
 mm/page_reporting.c     |  22 ++++++---
 mm/vmstat.c             |  15 ++++++
 9 files changed, 184 insertions(+), 29 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 550dbba92521a..9294cbbe698fc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -116,6 +116,7 @@ extern int page_group_by_mobility_disabled;
 			MIGRATETYPE_MASK)
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
+	struct list_head	pend_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
 };
 
@@ -1014,6 +1015,8 @@ struct zone {
 	/* Zone statistics */
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 	atomic_long_t		vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
+	/* Count pages that need tlb shootdown on allocation */
+	atomic_long_t		nr_luf_pages;
 } ____cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index c9fb559a63993..ca10796855aba 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1285,6 +1285,20 @@ static void mark_free_pages(struct zone *zone)
 				swsusp_set_page_free(pfn_to_page(pfn + i));
 			}
 		}
+
+		list_for_each_entry(page,
+				&zone->free_area[order].pend_list[t], buddy_list) {
+			unsigned long i;
+
+			pfn = page_to_pfn(page);
+			for (i = 0; i < (1UL << order); i++) {
+				if (!--page_count) {
+					touch_nmi_watchdog();
+					page_count = WD_PAGE_COUNT;
+				}
+				swsusp_set_page_free(pfn_to_page(pfn + i));
+			}
+		}
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index 1fec61603ef32..638deb57f9ddd 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -188,11 +188,13 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_OFFSET(zone, vm_stat);
 	VMCOREINFO_OFFSET(zone, spanned_pages);
 	VMCOREINFO_OFFSET(free_area, free_list);
+	VMCOREINFO_OFFSET(free_area, pend_list);
 	VMCOREINFO_OFFSET(list_head, next);
 	VMCOREINFO_OFFSET(list_head, prev);
 	VMCOREINFO_LENGTH(zone.free_area, NR_PAGE_ORDERS);
 	log_buf_vmcoreinfo_setup();
 	VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
+	VMCOREINFO_LENGTH(free_area.pend_list, MIGRATE_TYPES);
 	VMCOREINFO_NUMBER(NR_FREE_PAGES);
 	VMCOREINFO_NUMBER(PG_lru);
 	VMCOREINFO_NUMBER(PG_private);
diff --git a/mm/compaction.c b/mm/compaction.c
index bf5ded83b9dd1..5dfa53252d75b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1592,24 +1592,28 @@ static void fast_isolate_freepages(struct compact_control *cc)
 	     order = next_search_order(cc, order)) {
 		struct free_area *area = &cc->zone->free_area[order];
 		struct list_head *freelist;
+		struct list_head *high_pfn_list;
 		struct page *freepage;
 		unsigned long flags;
 		unsigned int order_scanned = 0;
 		unsigned long high_pfn = 0;
+		bool consider_pend = false;
+		bool can_shootdown;
 
 		if (!area->nr_free)
 			continue;
 
-		luf_takeoff_start();
+		can_shootdown = luf_takeoff_start();
 		spin_lock_irqsave(&cc->zone->lock, flags);
 		freelist = &area->free_list[MIGRATE_MOVABLE];
+retry:
 		list_for_each_entry_reverse(freepage, freelist, buddy_list) {
 			unsigned long pfn;
 
 			order_scanned++;
 			nr_scanned++;
 
-			if (!luf_takeoff_check(freepage))
+			if (unlikely(consider_pend && !luf_takeoff_check(freepage)))
 				goto scan_next;
 
 			pfn = page_to_pfn(freepage);
@@ -1622,26 +1626,34 @@ static void fast_isolate_freepages(struct compact_control *cc)
 				cc->fast_search_fail = 0;
 				cc->search_order = order;
 				page = freepage;
-				break;
+				goto done;
 			}
 
 			if (pfn >= min_pfn && pfn > high_pfn) {
 				high_pfn = pfn;
+				high_pfn_list = freelist;
 
 				/* Shorten the scan if a candidate is found */
 				limit >>= 1;
 			}
 scan_next:
 			if (order_scanned >= limit)
-				break;
+				goto done;
 		}
 
+		if (!consider_pend && can_shootdown) {
+			consider_pend = true;
+			freelist = &area->pend_list[MIGRATE_MOVABLE];
+			goto retry;
+		}
+done:
 		/* Use a maximum candidate pfn if a preferred one was not found */
 		if (!page && high_pfn) {
 			page = pfn_to_page(high_pfn);
 
 			/* Update freepage for the list reorder below */
 			freepage = page;
+			freelist = high_pfn_list;
 		}
 
 		/* Reorder to so a future search skips recent pages */
@@ -2040,18 +2052,20 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
 		struct list_head *freelist;
 		unsigned long flags;
 		struct page *freepage;
+		bool consider_pend = false;
 
 		if (!area->nr_free)
 			continue;
 
 		spin_lock_irqsave(&cc->zone->lock, flags);
 		freelist = &area->free_list[MIGRATE_MOVABLE];
+retry:
 		list_for_each_entry(freepage, freelist, buddy_list) {
 			unsigned long free_pfn;
 
 			if (nr_scanned++ >= limit) {
 				move_freelist_tail(freelist, freepage);
-				break;
+				goto done;
 			}
 
 			free_pfn = page_to_pfn(freepage);
@@ -2074,9 +2088,16 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
 					pfn = cc->zone->zone_start_pfn;
 				cc->fast_search_fail = 0;
 				found_block = true;
-				break;
+				goto done;
 			}
 		}
+
+		if (!consider_pend) {
+			consider_pend = true;
+			freelist = &area->pend_list[MIGRATE_MOVABLE];
+			goto retry;
+		}
+done:
 		spin_unlock_irqrestore(&cc->zone->lock, flags);
 	}
 
diff --git a/mm/internal.h b/mm/internal.h
index 77b7e6d0bcc29..d34fd43086d89 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -865,11 +865,16 @@ static inline void init_cma_pageblock(struct page *page)
 int find_suitable_fallback(struct free_area *area, unsigned int order,
 			int migratetype, bool only_stealable, bool *can_steal);
 
-static inline bool free_area_empty(struct free_area *area, int migratetype)
+static inline bool free_list_empty(struct free_area *area, int migratetype)
 {
 	return list_empty(&area->free_list[migratetype]);
 }
 
+static inline bool free_area_empty(struct free_area *area, int migratetype)
+{
+	return list_empty(&area->free_list[migratetype]) &&
+	       list_empty(&area->pend_list[migratetype]);
+}
 /* mm/util.c */
 struct anon_vma *folio_anon_vma(const struct folio *folio);
 
@@ -1605,12 +1610,22 @@ void luf_takeoff_end(void);
 bool luf_takeoff_no_shootdown(void);
 bool luf_takeoff_check(struct page *page);
 bool luf_takeoff_check_and_fold(struct page *page);
+
+static inline bool non_luf_pages_ok(struct zone *zone)
+{
+	unsigned long nr_free = zone_page_state(zone, NR_FREE_PAGES);
+	unsigned long min_wm = min_wmark_pages(zone);
+	unsigned long nr_luf_pages = atomic_long_read(&zone->nr_luf_pages);
+
+	return nr_free - nr_luf_pages > min_wm;
+}
 #else
 static inline bool luf_takeoff_start(void) { return false; }
 static inline void luf_takeoff_end(void) {}
 static inline bool luf_takeoff_no_shootdown(void) { return true; }
 static inline bool luf_takeoff_check(struct page *page) { return true; }
 static inline bool luf_takeoff_check_and_fold(struct page *page) { return true; }
+static inline bool non_luf_pages_ok(struct zone *zone) { return true; }
 #endif
 
 /* pagewalk.c */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 133640a93d1da..81c5060496112 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1421,12 +1421,14 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	unsigned int order, t;
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
+		INIT_LIST_HEAD(&zone->free_area[order].pend_list[t]);
 		zone->free_area[order].nr_free = 0;
 	}
 
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	INIT_LIST_HEAD(&zone->unaccepted_pages);
 #endif
+	atomic_long_set(&zone->nr_luf_pages, 0);
 }
 
 void __meminit init_currently_empty_zone(struct zone *zone,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a0182421da13e..530c5c16ab323 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -804,15 +804,28 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
 				      bool tail)
 {
 	struct free_area *area = &zone->free_area[order];
+	struct list_head *list;
 
 	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
 		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
 		     get_pageblock_migratetype(page), migratetype, 1 << order);
 
+	/*
+	 * When identifying whether a page requires tlb shootdown, false
+	 * positive is okay because it will cause just additional tlb
+	 * shootdown.
+	 */
+	if (page_luf_key(page)) {
+		list = &area->pend_list[migratetype];
+		atomic_long_add(1 << order, &zone->nr_luf_pages);
+	} else
+		list = &area->free_list[migratetype];
+
 	if (tail)
-		list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
+		list_add_tail(&page->buddy_list, list);
 	else
-		list_add(&page->buddy_list, &area->free_list[migratetype]);
+		list_add(&page->buddy_list, list);
+
 	area->nr_free++;
 }
 
@@ -831,7 +844,20 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
 		     get_pageblock_migratetype(page), old_mt, 1 << order);
 
-	list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
+	/*
+	 * The page might have been taken from a pfn where it's not
+	 * clear which list was used.  Therefore, conservatively
+	 * consider it as pend_list, not to miss any true ones that
+	 * require tlb shootdown.
+	 *
+	 * When identifying whether a page requires tlb shootdown, false
+	 * positive is okay because it will cause just additional tlb
+	 * shootdown.
+	 */
+	if (page_luf_key(page))
+		list_move_tail(&page->buddy_list, &area->pend_list[new_mt]);
+	else
+		list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
 
 	account_freepages(zone, -(1 << order), old_mt);
 	account_freepages(zone, 1 << order, new_mt);
@@ -848,6 +874,9 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
 	if (page_reported(page))
 		__ClearPageReported(page);
 
+	if (page_luf_key(page))
+		atomic_long_sub(1 << order, &zone->nr_luf_pages);
+
 	list_del(&page->buddy_list);
 	__ClearPageBuddy(page);
 	zone->free_area[order].nr_free--;
@@ -866,15 +895,48 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 	account_freepages(zone, -(1 << order), migratetype);
 }
 
-static inline struct page *get_page_from_free_area(struct free_area *area,
-					    int migratetype)
+static inline struct page *get_page_from_free_area(struct zone *zone,
+		struct free_area *area, int migratetype)
 {
-	struct page *page = list_first_entry_or_null(&area->free_list[migratetype],
-					struct page, buddy_list);
+	struct page *page;
+	bool pend_first;
 
-	if (page && luf_takeoff_check(page))
-		return page;
+	/*
+	 * XXX: Make the decision preciser if needed e.g. using
+	 * zone_watermark_ok() or its family, but for now, don't want to
+	 * make it heavier.
+	 *
+	 * Try free_list, holding non-luf pages, first if there are
+	 * enough non-luf pages to aggressively defer tlb flush, but
+	 * should try pend_list first instead if not.
+	 */
+	pend_first = !non_luf_pages_ok(zone);
+
+	if (pend_first) {
+		page = list_first_entry_or_null(&area->pend_list[migratetype],
+				struct page, buddy_list);
+
+		if (page && luf_takeoff_check(page))
+			return page;
+
+		page = list_first_entry_or_null(&area->free_list[migratetype],
+				struct page, buddy_list);
+
+		if (page)
+			return page;
+	} else {
+		page = list_first_entry_or_null(&area->free_list[migratetype],
+				struct page, buddy_list);
+
+		if (page)
+			return page;
 
+		page = list_first_entry_or_null(&area->pend_list[migratetype],
+				struct page, buddy_list);
+
+		if (page && luf_takeoff_check(page))
+			return page;
+	}
 	return NULL;
 }
 
@@ -1027,6 +1089,8 @@ static inline void __free_one_page(struct page *page,
 
 	if (fpi_flags & FPI_TO_TAIL)
 		to_tail = true;
+	else if (page_luf_key(page))
+		to_tail = true;
 	else if (is_shuffle_order(order))
 		to_tail = shuffle_pick_tail();
 	else
@@ -1630,6 +1694,8 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 	unsigned int nr_added = 0;
 
 	while (high > low) {
+		bool tail = false;
+
 		high--;
 		size >>= 1;
 		VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);
@@ -1643,7 +1709,10 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 		if (set_page_guard(zone, &page[size], high))
 			continue;
 
-		__add_to_free_list(&page[size], zone, high, migratetype, false);
+		if (page_luf_key(&page[size]))
+			tail = true;
+
+		__add_to_free_list(&page[size], zone, high, migratetype, tail);
 		set_buddy_order(&page[size], high);
 		nr_added += size;
 	}
@@ -1827,7 +1896,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) {
 		area = &(zone->free_area[current_order]);
-		page = get_page_from_free_area(area, migratetype);
+		page = get_page_from_free_area(zone, area, migratetype);
 		if (!page)
 			continue;
 
@@ -2269,7 +2338,8 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 		if (free_area_empty(area, fallback_mt))
 			continue;
 
-		if (luf_takeoff_no_shootdown())
+		if (free_list_empty(area, fallback_mt) &&
+		    luf_takeoff_no_shootdown())
 			continue;
 
 		if (can_steal_fallback(order, migratetype))
@@ -2373,7 +2443,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			struct free_area *area = &(zone->free_area[order]);
 			int mt;
 
-			page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
+			page = get_page_from_free_area(zone, area, MIGRATE_HIGHATOMIC);
 			if (!page)
 				continue;
 
@@ -2511,7 +2581,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 	VM_BUG_ON(current_order > MAX_PAGE_ORDER);
 
 do_steal:
-	page = get_page_from_free_area(area, fallback_mt);
+	page = get_page_from_free_area(zone, area, fallback_mt);
 
 	/* take off list, maybe claim block, expand remainder */
 	page = steal_suitable_fallback(zone, page, current_order, order,
@@ -7133,6 +7203,8 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
 	struct page *current_buddy;
 
 	while (high > low) {
+		bool tail = false;
+
 		high--;
 		size >>= 1;
 
@@ -7146,7 +7218,10 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, current_buddy, high))
 			continue;
 
-		add_to_free_list(current_buddy, zone, high, migratetype, false);
+		if (page_luf_key(current_buddy))
+			tail = true;
+
+		add_to_free_list(current_buddy, zone, high, migratetype, tail);
 		set_buddy_order(current_buddy, high);
 	}
 }
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 03a7f5f6dc073..e152b22fbba8a 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -159,15 +159,17 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	struct page *page, *next;
 	long budget;
 	int err = 0;
+	bool consider_pend = false;
+	bool can_shootdown;
 
 	/*
 	 * Perform early check, if free area is empty there is
 	 * nothing to process so we can skip this free_list.
 	 */
-	if (list_empty(list))
+	if (free_area_empty(area, mt))
 		return err;
 
-	luf_takeoff_start();
+	can_shootdown = luf_takeoff_start();
 	spin_lock_irq(&zone->lock);
 
 	/*
@@ -185,14 +187,14 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	 * should always be a power of 2.
 	 */
 	budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
-
+retry:
 	/* loop through free list adding unreported pages to sg list */
 	list_for_each_entry_safe(page, next, list, lru) {
 		/* We are going to skip over the reported pages. */
 		if (PageReported(page))
 			continue;
 
-		if (!luf_takeoff_check(page)) {
+		if (unlikely(consider_pend && !luf_takeoff_check(page))) {
 			VM_WARN_ON(1);
 			continue;
 		}
@@ -205,14 +207,14 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		if (budget < 0) {
 			atomic_set(&prdev->state, PAGE_REPORTING_REQUESTED);
 			next = page;
-			break;
+			goto done;
 		}
 
 		/* Attempt to pull page from list and place in scatterlist */
 		if (*offset) {
 			if (!__isolate_free_page(page, order, false)) {
 				next = page;
-				break;
+				goto done;
 			}
 
 			/* Add page to scatter list */
@@ -263,9 +265,15 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 
 		/* exit on error */
 		if (err)
-			break;
+			goto done;
 	}
 
+	if (!consider_pend && can_shootdown) {
+		consider_pend = true;
+		list = &area->pend_list[mt];
+		goto retry;
+	}
+done:
 	/* Rotate any leftover pages to the head of the freelist */
 	if (!list_entry_is_head(next, list, lru) && !list_is_first(&next->lru, list))
 		list_rotate_to_front(&next->lru, list);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 16bfe1c694dd4..5ae5ac9f0a4a9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1581,6 +1581,21 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 					break;
 				}
 			}
+			list_for_each(curr, &area->pend_list[mtype]) {
+				/*
+				 * Cap the pend_list iteration because it might
+				 * be really large and we are under a spinlock
+				 * so a long time spent here could trigger a
+				 * hard lockup detector. Anyway this is a
+				 * debugging tool so knowing there is a handful
+				 * of pages of this order should be more than
+				 * sufficient.
+				 */
+				if (++freecount >= 100000) {
+					overflow = true;
+					break;
+				}
+			}
 			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
 			spin_unlock_irq(&zone->lock);
 			cond_resched();
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 14/25]  mm/rmap: recognize read-only tlb entries during batched tlb flush
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (11 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 13/25] mm: introduce pend_list in struct free_area to track luf'd pages Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 15/25] fs, filemap: refactor to gather the scattered ->write_{begin,end}() calls Byungchul Park
                             ` (10 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
requires to recognize read-only tlb entries and handle them in a
different way.  The newly introduced API in this patch, fold_ubc(), will
be used by luf mechanism.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/sched.h |  1 +
 mm/rmap.c             | 16 ++++++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a3049ea5b3ad3..d1a3c97491ff2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1407,6 +1407,7 @@ struct task_struct {
 
 	struct tlbflush_unmap_batch	tlb_ubc;
 	struct tlbflush_unmap_batch	tlb_ubc_takeoff;
+	struct tlbflush_unmap_batch	tlb_ubc_ro;
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/mm/rmap.c b/mm/rmap.c
index 40de03c8f73be..c9c594d73058c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -775,6 +775,7 @@ void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
 void try_to_unmap_flush_takeoff(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 
 	if (!tlb_ubc_takeoff->flush_required)
@@ -789,6 +790,9 @@ void try_to_unmap_flush_takeoff(void)
 	if (arch_tlbbatch_done(&tlb_ubc->arch, &tlb_ubc_takeoff->arch))
 		reset_batch(tlb_ubc);
 
+	if (arch_tlbbatch_done(&tlb_ubc_ro->arch, &tlb_ubc_takeoff->arch))
+		reset_batch(tlb_ubc_ro);
+
 	reset_batch(tlb_ubc_takeoff);
 }
 
@@ -801,7 +805,9 @@ void try_to_unmap_flush_takeoff(void)
 void try_to_unmap_flush(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 
+	fold_batch(tlb_ubc, tlb_ubc_ro, true);
 	if (!tlb_ubc->flush_required)
 		return;
 
@@ -813,8 +819,9 @@ void try_to_unmap_flush(void)
 void try_to_unmap_flush_dirty(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 
-	if (tlb_ubc->writable)
+	if (tlb_ubc->writable || tlb_ubc_ro->writable)
 		try_to_unmap_flush();
 }
 
@@ -831,13 +838,18 @@ void try_to_unmap_flush_dirty(void)
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 		unsigned long start, unsigned long end)
 {
-	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc;
 	int batch;
 	bool writable = pte_dirty(pteval);
 
 	if (!pte_accessible(mm, pteval))
 		return;
 
+	if (pte_write(pteval))
+		tlb_ubc = &current->tlb_ubc;
+	else
+		tlb_ubc = &current->tlb_ubc_ro;
+
 	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, start, end);
 	tlb_ubc->flush_required = true;
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 15/25] fs, filemap: refactor to gather the scattered ->write_{begin,end}() calls
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (12 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 14/25] mm/rmap: recognize read-only tlb entries during batched tlb flush Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 16/25] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped Byungchul Park
                             ` (9 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Functionally, no change.  This is a preparation for luf mechanism that
requires to hook when updating page cache that might have pages that
have been mapped on any tasks so that tlb flush needed can be performed.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 11 ++++-------
 fs/affs/file.c                            |  4 ++--
 fs/buffer.c                               | 14 ++++++--------
 fs/exfat/file.c                           |  5 ++---
 fs/ext4/verity.c                          |  5 ++---
 fs/f2fs/super.c                           |  5 ++---
 fs/f2fs/verity.c                          |  5 ++---
 fs/namei.c                                |  5 ++---
 include/linux/fs.h                        | 18 ++++++++++++++++++
 mm/filemap.c                              |  5 ++---
 10 files changed, 42 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index ae3343c81a645..22ce009d13689 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -418,7 +418,6 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
 	     const struct drm_i915_gem_pwrite *arg)
 {
 	struct address_space *mapping = obj->base.filp->f_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 	char __user *user_data = u64_to_user_ptr(arg->data_ptr);
 	u64 remain;
 	loff_t pos;
@@ -477,7 +476,7 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
 		if (err)
 			return err;
 
-		err = aops->write_begin(obj->base.filp, mapping, pos, len,
+		err = mapping_write_begin(obj->base.filp, mapping, pos, len,
 					&folio, &data);
 		if (err < 0)
 			return err;
@@ -488,7 +487,7 @@ shmem_pwrite(struct drm_i915_gem_object *obj,
 		pagefault_enable();
 		kunmap_local(vaddr);
 
-		err = aops->write_end(obj->base.filp, mapping, pos, len,
+		err = mapping_write_end(obj->base.filp, mapping, pos, len,
 				      len - unwritten, folio, data);
 		if (err < 0)
 			return err;
@@ -654,7 +653,6 @@ i915_gem_object_create_shmem_from_data(struct drm_i915_private *i915,
 {
 	struct drm_i915_gem_object *obj;
 	struct file *file;
-	const struct address_space_operations *aops;
 	loff_t pos;
 	int err;
 
@@ -666,21 +664,20 @@ i915_gem_object_create_shmem_from_data(struct drm_i915_private *i915,
 	GEM_BUG_ON(obj->write_domain != I915_GEM_DOMAIN_CPU);
 
 	file = obj->base.filp;
-	aops = file->f_mapping->a_ops;
 	pos = 0;
 	do {
 		unsigned int len = min_t(typeof(size), size, PAGE_SIZE);
 		struct folio *folio;
 		void *fsdata;
 
-		err = aops->write_begin(file, file->f_mapping, pos, len,
+		err = mapping_write_begin(file, file->f_mapping, pos, len,
 					&folio, &fsdata);
 		if (err < 0)
 			goto fail;
 
 		memcpy_to_folio(folio, offset_in_folio(folio, pos), data, len);
 
-		err = aops->write_end(file, file->f_mapping, pos, len, len,
+		err = mapping_write_end(file, file->f_mapping, pos, len, len,
 				      folio, fsdata);
 		if (err < 0)
 			goto fail;
diff --git a/fs/affs/file.c b/fs/affs/file.c
index a5a861dd52230..10e7f53828e93 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -885,9 +885,9 @@ affs_truncate(struct inode *inode)
 		loff_t isize = inode->i_size;
 		int res;
 
-		res = mapping->a_ops->write_begin(NULL, mapping, isize, 0, &folio, &fsdata);
+		res = mapping_write_begin(NULL, mapping, isize, 0, &folio, &fsdata);
 		if (!res)
-			res = mapping->a_ops->write_end(NULL, mapping, isize, 0, 0, folio, fsdata);
+			res = mapping_write_end(NULL, mapping, isize, 0, 0, folio, fsdata);
 		else
 			inode->i_size = AFFS_I(inode)->mmu_private;
 		mark_inode_dirty(inode);
diff --git a/fs/buffer.c b/fs/buffer.c
index c66a59bb068b9..6655912f12c46 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2457,7 +2457,6 @@ EXPORT_SYMBOL(block_read_full_folio);
 int generic_cont_expand_simple(struct inode *inode, loff_t size)
 {
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 	struct folio *folio;
 	void *fsdata = NULL;
 	int err;
@@ -2466,11 +2465,11 @@ int generic_cont_expand_simple(struct inode *inode, loff_t size)
 	if (err)
 		goto out;
 
-	err = aops->write_begin(NULL, mapping, size, 0, &folio, &fsdata);
+	err = mapping_write_begin(NULL, mapping, size, 0, &folio, &fsdata);
 	if (err)
 		goto out;
 
-	err = aops->write_end(NULL, mapping, size, 0, 0, folio, fsdata);
+	err = mapping_write_end(NULL, mapping, size, 0, 0, folio, fsdata);
 	BUG_ON(err > 0);
 
 out:
@@ -2482,7 +2481,6 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
 			    loff_t pos, loff_t *bytes)
 {
 	struct inode *inode = mapping->host;
-	const struct address_space_operations *aops = mapping->a_ops;
 	unsigned int blocksize = i_blocksize(inode);
 	struct folio *folio;
 	void *fsdata = NULL;
@@ -2502,12 +2500,12 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
 		}
 		len = PAGE_SIZE - zerofrom;
 
-		err = aops->write_begin(file, mapping, curpos, len,
+		err = mapping_write_begin(file, mapping, curpos, len,
 					    &folio, &fsdata);
 		if (err)
 			goto out;
 		folio_zero_range(folio, offset_in_folio(folio, curpos), len);
-		err = aops->write_end(file, mapping, curpos, len, len,
+		err = mapping_write_end(file, mapping, curpos, len, len,
 						folio, fsdata);
 		if (err < 0)
 			goto out;
@@ -2535,12 +2533,12 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
 		}
 		len = offset - zerofrom;
 
-		err = aops->write_begin(file, mapping, curpos, len,
+		err = mapping_write_begin(file, mapping, curpos, len,
 					    &folio, &fsdata);
 		if (err)
 			goto out;
 		folio_zero_range(folio, offset_in_folio(folio, curpos), len);
-		err = aops->write_end(file, mapping, curpos, len, len,
+		err = mapping_write_end(file, mapping, curpos, len, len,
 						folio, fsdata);
 		if (err < 0)
 			goto out;
diff --git a/fs/exfat/file.c b/fs/exfat/file.c
index 05b51e7217838..9a1002761f79f 100644
--- a/fs/exfat/file.c
+++ b/fs/exfat/file.c
@@ -539,7 +539,6 @@ static int exfat_extend_valid_size(struct file *file, loff_t new_valid_size)
 	struct inode *inode = file_inode(file);
 	struct exfat_inode_info *ei = EXFAT_I(inode);
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *ops = mapping->a_ops;
 
 	pos = ei->valid_size;
 	while (pos < new_valid_size) {
@@ -551,14 +550,14 @@ static int exfat_extend_valid_size(struct file *file, loff_t new_valid_size)
 		if (pos + len > new_valid_size)
 			len = new_valid_size - pos;
 
-		err = ops->write_begin(file, mapping, pos, len, &folio, NULL);
+		err = mapping_write_begin(file, mapping, pos, len, &folio, NULL);
 		if (err)
 			goto out;
 
 		off = offset_in_folio(folio, pos);
 		folio_zero_new_buffers(folio, off, off + len);
 
-		err = ops->write_end(file, mapping, pos, len, len, folio, NULL);
+		err = mapping_write_end(file, mapping, pos, len, len, folio, NULL);
 		if (err < 0)
 			goto out;
 		pos += len;
diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index d9203228ce979..64fa43f80c73e 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -68,7 +68,6 @@ static int pagecache_write(struct inode *inode, const void *buf, size_t count,
 			   loff_t pos)
 {
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 
 	if (pos + count > inode->i_sb->s_maxbytes)
 		return -EFBIG;
@@ -80,13 +79,13 @@ static int pagecache_write(struct inode *inode, const void *buf, size_t count,
 		void *fsdata = NULL;
 		int res;
 
-		res = aops->write_begin(NULL, mapping, pos, n, &folio, &fsdata);
+		res = mapping_write_begin(NULL, mapping, pos, n, &folio, &fsdata);
 		if (res)
 			return res;
 
 		memcpy_to_folio(folio, offset_in_folio(folio, pos), buf, n);
 
-		res = aops->write_end(NULL, mapping, pos, n, n, folio, fsdata);
+		res = mapping_write_end(NULL, mapping, pos, n, n, folio, fsdata);
 		if (res < 0)
 			return res;
 		if (res != n)
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 19b67828ae325..87c26f0571dab 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -2710,7 +2710,6 @@ static ssize_t f2fs_quota_write(struct super_block *sb, int type,
 {
 	struct inode *inode = sb_dqopt(sb)->files[type];
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
 	int offset = off & (sb->s_blocksize - 1);
 	size_t towrite = len;
 	struct folio *folio;
@@ -2722,7 +2721,7 @@ static ssize_t f2fs_quota_write(struct super_block *sb, int type,
 		tocopy = min_t(unsigned long, sb->s_blocksize - offset,
 								towrite);
 retry:
-		err = a_ops->write_begin(NULL, mapping, off, tocopy,
+		err = mapping_write_begin(NULL, mapping, off, tocopy,
 							&folio, &fsdata);
 		if (unlikely(err)) {
 			if (err == -ENOMEM) {
@@ -2735,7 +2734,7 @@ static ssize_t f2fs_quota_write(struct super_block *sb, int type,
 
 		memcpy_to_folio(folio, offset_in_folio(folio, off), data, tocopy);
 
-		a_ops->write_end(NULL, mapping, off, tocopy, tocopy,
+		mapping_write_end(NULL, mapping, off, tocopy, tocopy,
 						folio, fsdata);
 		offset = 0;
 		towrite -= tocopy;
diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
index 2287f238ae09e..b232589546d39 100644
--- a/fs/f2fs/verity.c
+++ b/fs/f2fs/verity.c
@@ -72,7 +72,6 @@ static int pagecache_write(struct inode *inode, const void *buf, size_t count,
 			   loff_t pos)
 {
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 
 	if (pos + count > F2FS_BLK_TO_BYTES(max_file_blocks(inode)))
 		return -EFBIG;
@@ -84,13 +83,13 @@ static int pagecache_write(struct inode *inode, const void *buf, size_t count,
 		void *fsdata = NULL;
 		int res;
 
-		res = aops->write_begin(NULL, mapping, pos, n, &folio, &fsdata);
+		res = mapping_write_begin(NULL, mapping, pos, n, &folio, &fsdata);
 		if (res)
 			return res;
 
 		memcpy_to_folio(folio, offset_in_folio(folio, pos), buf, n);
 
-		res = aops->write_end(NULL, mapping, pos, n, n, folio, fsdata);
+		res = mapping_write_end(NULL, mapping, pos, n, n, folio, fsdata);
 		if (res < 0)
 			return res;
 		if (res != n)
diff --git a/fs/namei.c b/fs/namei.c
index 3ab9440c5b931..e1c6d28c560da 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -5409,7 +5409,6 @@ EXPORT_SYMBOL(page_readlink);
 int page_symlink(struct inode *inode, const char *symname, int len)
 {
 	struct address_space *mapping = inode->i_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
 	bool nofs = !mapping_gfp_constraint(mapping, __GFP_FS);
 	struct folio *folio;
 	void *fsdata = NULL;
@@ -5419,7 +5418,7 @@ int page_symlink(struct inode *inode, const char *symname, int len)
 retry:
 	if (nofs)
 		flags = memalloc_nofs_save();
-	err = aops->write_begin(NULL, mapping, 0, len-1, &folio, &fsdata);
+	err = mapping_write_begin(NULL, mapping, 0, len-1, &folio, &fsdata);
 	if (nofs)
 		memalloc_nofs_restore(flags);
 	if (err)
@@ -5427,7 +5426,7 @@ int page_symlink(struct inode *inode, const char *symname, int len)
 
 	memcpy(folio_address(folio), symname, len - 1);
 
-	err = aops->write_end(NULL, mapping, 0, len - 1, len - 1,
+	err = mapping_write_end(NULL, mapping, 0, len - 1, len - 1,
 						folio, fsdata);
 	if (err < 0)
 		goto fail;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c3b2f8a621f7..820ff4752249e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -531,6 +531,24 @@ struct address_space {
 #define PAGECACHE_TAG_WRITEBACK	XA_MARK_1
 #define PAGECACHE_TAG_TOWRITE	XA_MARK_2
 
+static inline int mapping_write_begin(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len,
+				struct folio **foliop, void **fsdata)
+{
+	return mapping->a_ops->write_begin(file, mapping, pos, len, foliop,
+			fsdata);
+}
+
+static inline int mapping_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct folio *folio, void *fsdata)
+{
+	return mapping->a_ops->write_end(file, mapping, pos, len, copied,
+			folio, fsdata);
+}
+
 /*
  * Returns true if any of the pages in the mapping are marked with the tag.
  */
diff --git a/mm/filemap.c b/mm/filemap.c
index c6650de837d06..1c6fda5a43020 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -4141,7 +4141,6 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 	struct file *file = iocb->ki_filp;
 	loff_t pos = iocb->ki_pos;
 	struct address_space *mapping = file->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
 	size_t chunk = mapping_max_folio_size(mapping);
 	long status = 0;
 	ssize_t written = 0;
@@ -4175,7 +4174,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 			break;
 		}
 
-		status = a_ops->write_begin(file, mapping, pos, bytes,
+		status = mapping_write_begin(file, mapping, pos, bytes,
 						&folio, &fsdata);
 		if (unlikely(status < 0))
 			break;
@@ -4190,7 +4189,7 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
 		flush_dcache_folio(folio);
 
-		status = a_ops->write_end(file, mapping, pos, bytes, copied,
+		status = mapping_write_end(file, mapping, pos, bytes, copied,
 						folio, fsdata);
 		if (unlikely(status != copied)) {
 			iov_iter_revert(i, copied - max(status, 0L));
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 16/25] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (13 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 15/25] fs, filemap: refactor to gather the scattered ->write_{begin,end}() calls Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 17/25] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done Byungchul Park
                             ` (8 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read-only and were unmapped, as
long as the contents of the folios don't change while staying in pcp or
buddy so we can still read the data through the stale tlb entries.

tlb flush can be defered when folios get unmapped as long as it
guarantees to perform tlb flush needed, before the folios actually
become used, of course, only if all the corresponding ptes don't have
write permission.  Otherwise, the system will get messed up.

To achieve that, for the folios that map only to non-writable tlb
entries, prevent tlb flush during unmapping but perform it just before
the folios actually become used, out of buddy or pcp.

However, we should cancel the pending by LUF and perform the deferred
TLB flush right away when:

   1. a writable pte is newly set through fault handler
   2. a file is updated
   3. kasan needs poisoning on free
   4. the kernel wants to init pages on free

No matter what type of workload is used for performance evaluation, the
result would be positive thanks to the unconditional reduction of tlb
flushes, tlb misses and interrupts.  For the test, I picked up one of
the most popular and heavy workload, llama.cpp that is a
LLM(Large Language Model) inference engine.

The result would depend on memory latency and how often reclaim runs,
which implies tlb miss overhead and how many times unmapping happens.
In my system, the result shows:

   1. tlb shootdown interrupts are reduced about 97%.
   2. The test program runtime is reduced about 4.5%.

The test environment and the test set are like:

   Machine: bare metal, x86_64, Intel(R) Xeon(R) Gold 6430
   CPU: 1 socket 64 core with hyper thread on
   Numa: 2 nodes (64 CPUs DRAM 42GB, no CPUs CXL expander 98GB)
   Config: swap off, numa balancing tiering on, demotion enabled

   llama.cpp/main -m $(70G_model1) -p "who are you?" -s 1 -t 15 -n 20 &
   llama.cpp/main -m $(70G_model2) -p "who are you?" -s 1 -t 15 -n 20 &
   llama.cpp/main -m $(70G_model3) -p "who are you?" -s 1 -t 15 -n 20 &
   wait

   where,
   -t: nr of threads, -s: seed used to make the runtime stable,
   -n: nr of tokens that determines the runtime, -p: prompt to ask,
   -m: LLM model to use.

Run the test set 5 times successively with caches dropped every run via
'echo 3 > /proc/sys/vm/drop_caches'.  Each inference prints its runtime
at the end of each.  The results are like:

   1. Runtime from the output of llama.cpp

   BEFORE
   ------
   llama_print_timings:       total time =  883450.54 ms /    24 tokens
   llama_print_timings:       total time =  861665.91 ms /    24 tokens
   llama_print_timings:       total time =  898079.02 ms /    24 tokens
   llama_print_timings:       total time =  879897.69 ms /    24 tokens
   llama_print_timings:       total time =  892360.75 ms /    24 tokens
   llama_print_timings:       total time =  884587.85 ms /    24 tokens
   llama_print_timings:       total time =  861023.19 ms /    24 tokens
   llama_print_timings:       total time =  900022.18 ms /    24 tokens
   llama_print_timings:       total time =  878771.88 ms /    24 tokens
   llama_print_timings:       total time =  889027.98 ms /    24 tokens
   llama_print_timings:       total time =  880783.90 ms /    24 tokens
   llama_print_timings:       total time =  856475.29 ms /    24 tokens
   llama_print_timings:       total time =  896842.21 ms /    24 tokens
   llama_print_timings:       total time =  878883.53 ms /    24 tokens
   llama_print_timings:       total time =  890122.10 ms /    24 tokens

   AFTER
   -----
   llama_print_timings:       total time =  871060.86 ms /    24 tokens
   llama_print_timings:       total time =  825609.53 ms /    24 tokens
   llama_print_timings:       total time =  836854.81 ms /    24 tokens
   llama_print_timings:       total time =  843147.99 ms /    24 tokens
   llama_print_timings:       total time =  831426.65 ms /    24 tokens
   llama_print_timings:       total time =  873939.23 ms /    24 tokens
   llama_print_timings:       total time =  826127.69 ms /    24 tokens
   llama_print_timings:       total time =  835489.26 ms /    24 tokens
   llama_print_timings:       total time =  842589.62 ms /    24 tokens
   llama_print_timings:       total time =  833700.66 ms /    24 tokens
   llama_print_timings:       total time =  875996.19 ms /    24 tokens
   llama_print_timings:       total time =  826401.73 ms /    24 tokens
   llama_print_timings:       total time =  839341.28 ms /    24 tokens
   llama_print_timings:       total time =  841075.10 ms /    24 tokens
   llama_print_timings:       total time =  835136.41 ms /    24 tokens

   2. tlb shootdowns from 'cat /proc/interrupts'

   BEFORE
   ------
   TLB:
    80911532   93691786  100296251  111062810  109769109  109862429
   108968588  119175230  115779676  118377498  119325266  120300143
   124514185  116697222  121068466  118031913  122660681  117494403
   121819907  116960596  120936335  117217061  118630217  122322724
   119595577  111693298  119232201  120030377  115334687  113179982
   118808254  116353592  140987367  137095516  131724276  139742240
   136501150  130428761  127585535  132483981  133430250  133756207
   131786710  126365824  129812539  133850040  131742690  125142213
   128572830  132234350  131945922  128417707  133355434  129972846
   126331823  134050849  133991626  121129038  124637283  132830916
   126875507  122322440  125776487  124340278   TLB shootdowns

   AFTER
   -----
   TLB:
     2121206    2615108    2983494    2911950    3055086    3092672
     3204894    3346082    3286744    3307310    3357296    3315940
     3428034    3112596    3143325    3185551    3186493    3322314
     3330523    3339663    3156064    3272070    3296309    3198962
     3332662    3315870    3234467    3353240    3281234    3300666
     3345452    3173097    4009196    3932215    3898735    3726531
     3717982    3671726    3728788    3724613    3799147    3691764
     3620630    3684655    3666688    3393974    3448651    3487593
     3446357    3618418    3671920    3712949    3575264    3715385
     3641513    3630897    3691047    3630690    3504933    3662647
     3629926    3443044    3832970    3548813   TLB shootdowns

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/asm-generic/tlb.h |   5 ++
 include/linux/fs.h        |  12 +++-
 include/linux/mm_types.h  |   6 ++
 include/linux/sched.h     |   9 +++
 kernel/sched/core.c       |   1 +
 mm/internal.h             |  94 ++++++++++++++++++++++++-
 mm/memory.c               |  15 ++++
 mm/pgtable-generic.c      |   2 +
 mm/rmap.c                 | 141 +++++++++++++++++++++++++++++++++++---
 mm/truncate.c             |  55 +++++++++++++--
 mm/vmscan.c               |  12 +++-
 11 files changed, 333 insertions(+), 19 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b35b36fa7aabf..4b7d29d8ea794 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -567,6 +567,11 @@ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *
 
 static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
+	/*
+	 * Don't leave stale tlb entries for this vma.
+	 */
+	luf_flush(0);
+
 	if (tlb->fullmm || IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS))
 		return;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 820ff4752249e..78aaf769d32d1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -536,8 +536,18 @@ static inline int mapping_write_begin(struct file *file,
 				loff_t pos, unsigned len,
 				struct folio **foliop, void **fsdata)
 {
-	return mapping->a_ops->write_begin(file, mapping, pos, len, foliop,
+	int ret;
+
+	ret = mapping->a_ops->write_begin(file, mapping, pos, len, foliop,
 			fsdata);
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	if (!ret)
+		luf_flush(0);
+
+	return ret;
 }
 
 static inline int mapping_write_end(struct file *file,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f52d4e49e8736..117f8e822e969 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1353,6 +1353,12 @@ extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_finish_mmu(struct mmu_gather *tlb);
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+void luf_flush(unsigned short luf_key);
+#else
+static inline void luf_flush(unsigned short luf_key) {}
+#endif
+
 struct vm_fault;
 
 /**
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d1a3c97491ff2..47a0a3ccb7b1a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1408,6 +1408,15 @@ struct task_struct {
 	struct tlbflush_unmap_batch	tlb_ubc;
 	struct tlbflush_unmap_batch	tlb_ubc_takeoff;
 	struct tlbflush_unmap_batch	tlb_ubc_ro;
+	struct tlbflush_unmap_batch	tlb_ubc_luf;
+
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
+	/*
+	 * whether all the mappings of a folio during unmap are read-only
+	 * so that luf can work on the folio
+	 */
+	bool				can_luf;
+#endif
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9aecd914ac691..1f4c5da800365 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5275,6 +5275,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	if (mm) {
 		membarrier_mm_sync_core_before_usermode(mm);
 		mmdrop_lazy_tlb_sched(mm);
+		luf_flush(0);
 	}
 
 	if (unlikely(prev_state == TASK_DEAD)) {
diff --git a/mm/internal.h b/mm/internal.h
index d34fd43086d89..2429db598e265 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1619,13 +1619,105 @@ static inline bool non_luf_pages_ok(struct zone *zone)
 
 	return nr_free - nr_luf_pages > min_wm;
 }
-#else
+
+unsigned short fold_unmap_luf(void);
+
+/*
+ * Reset the indicator indicating there are no writable mappings at the
+ * beginning of every rmap traverse for unmap.  luf can work only when
+ * all the mappings are read-only.
+ */
+static inline void can_luf_init(struct folio *f)
+{
+	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC))
+		current->can_luf = false;
+	/*
+	 * Pages might get updated inside buddy.
+	 */
+	else if (want_init_on_free())
+		current->can_luf = false;
+	/*
+	 * Pages might get updated inside buddy.
+	 */
+	else if (!should_skip_kasan_poison(folio_page(f, 0)))
+		current->can_luf = false;
+	/*
+	 * XXX: Remove the constraint once luf handles zone device folio.
+	 */
+	else if (unlikely(folio_is_zone_device(f)))
+		current->can_luf = false;
+	/*
+	 * XXX: Remove the constraint once luf handles hugetlb folio.
+	 */
+	else if (unlikely(folio_test_hugetlb(f)))
+		current->can_luf = false;
+	/*
+	 * XXX: Remove the constraint once luf handles large folio.
+	 */
+	else if (unlikely(folio_test_large(f)))
+		current->can_luf = false;
+	/*
+	 * Can track write of anon folios through fault handler.
+	 */
+	else if (folio_test_anon(f))
+		current->can_luf = true;
+	/*
+	 * Can track write of file folios through page cache or truncation.
+	 */
+	else if (folio_mapping(f))
+		current->can_luf = true;
+	/*
+	 * For niehter anon nor file folios, do not apply luf.
+	 */
+	else
+		current->can_luf = false;
+}
+
+/*
+ * Mark the folio is not applicable to luf once it found a writble or
+ * dirty pte during rmap traverse for unmap.
+ */
+static inline void can_luf_fail(void)
+{
+	current->can_luf = false;
+}
+
+/*
+ * Check if all the mappings are read-only.
+ */
+static inline bool can_luf_test(void)
+{
+	return current->can_luf;
+}
+
+static inline bool can_luf_vma(struct vm_area_struct *vma)
+{
+	/*
+	 * Shared region requires a medium like file to keep all the
+	 * associated mm_struct.  luf makes use of strcut address_space
+	 * for that purpose.
+	 */
+	if (vma->vm_flags & VM_SHARED)
+		return !!vma->vm_file;
+
+	/*
+	 * Private region can be handled through its mm_struct.
+	 */
+	return true;
+}
+#else /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 static inline bool luf_takeoff_start(void) { return false; }
 static inline void luf_takeoff_end(void) {}
 static inline bool luf_takeoff_no_shootdown(void) { return true; }
 static inline bool luf_takeoff_check(struct page *page) { return true; }
 static inline bool luf_takeoff_check_and_fold(struct page *page) { return true; }
 static inline bool non_luf_pages_ok(struct zone *zone) { return true; }
+static inline unsigned short fold_unmap_luf(void) { return 0; }
+
+static inline void can_luf_init(struct folio *f) {}
+static inline void can_luf_fail(void) {}
+static inline bool can_luf_test(void) { return false; }
+static inline bool can_luf_vma(struct vm_area_struct *vma) { return false; }
 #endif
 
 /* pagewalk.c */
diff --git a/mm/memory.c b/mm/memory.c
index cacf6d53bdf32..e496d8deb887f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6216,6 +6216,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	struct mm_struct *mm = vma->vm_mm;
 	vm_fault_t ret;
 	bool is_droppable;
+	bool flush = false;
 
 	__set_current_state(TASK_RUNNING);
 
@@ -6241,6 +6242,14 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 
 	lru_gen_enter_fault(vma);
 
+	/*
+	 * Any potential cases that make pte writable even forcely
+	 * should be considered.
+	 */
+	if (vma->vm_flags & (VM_WRITE | VM_MAYWRITE) ||
+			flags & FAULT_FLAG_WRITE)
+		flush = true;
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
 	else
@@ -6272,6 +6281,12 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 out:
 	mm_account_fault(mm, regs, address, flags, ret);
 
+	/*
+	 * Ensure to clean stale tlb entries for this vma.
+	 */
+	if (flush)
+		luf_flush(0);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 5a882f2b10f90..d6678d6bac746 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -99,6 +99,8 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 	pte = ptep_get_and_clear(mm, address, ptep);
 	if (pte_accessible(mm, pte))
 		flush_tlb_page(vma, address);
+	else
+		luf_flush(0);
 	return pte;
 }
 #endif
diff --git a/mm/rmap.c b/mm/rmap.c
index c9c594d73058c..2191cf1d38270 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -646,7 +646,7 @@ static atomic_long_t luf_ugen = ATOMIC_LONG_INIT(LUF_UGEN_INIT);
 /*
  * Don't return invalid luf_ugen, zero.
  */
-static unsigned long __maybe_unused new_luf_ugen(void)
+static unsigned long new_luf_ugen(void)
 {
 	unsigned long ugen = atomic_long_inc_return(&luf_ugen);
 
@@ -723,7 +723,7 @@ static atomic_t luf_kgen = ATOMIC_INIT(1);
 /*
  * Don't return invalid luf_key, zero.
  */
-static unsigned short __maybe_unused new_luf_key(void)
+static unsigned short new_luf_key(void)
 {
 	unsigned short luf_key = atomic_inc_return(&luf_kgen);
 
@@ -776,6 +776,7 @@ void try_to_unmap_flush_takeoff(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 
 	if (!tlb_ubc_takeoff->flush_required)
@@ -793,9 +794,72 @@ void try_to_unmap_flush_takeoff(void)
 	if (arch_tlbbatch_done(&tlb_ubc_ro->arch, &tlb_ubc_takeoff->arch))
 		reset_batch(tlb_ubc_ro);
 
+	if (arch_tlbbatch_done(&tlb_ubc_luf->arch, &tlb_ubc_takeoff->arch))
+		reset_batch(tlb_ubc_luf);
+
 	reset_batch(tlb_ubc_takeoff);
 }
 
+/*
+ * Should be called just before try_to_unmap_flush() to optimize the tlb
+ * shootdown using arch_tlbbatch_done().
+ */
+unsigned short fold_unmap_luf(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
+	struct luf_batch *lb;
+	unsigned long new_ugen;
+	unsigned short new_key;
+	unsigned long flags;
+
+	if (!tlb_ubc_luf->flush_required)
+		return 0;
+
+	/*
+	 * fold_unmap_luf() is always followed by try_to_unmap_flush().
+	 */
+	if (arch_tlbbatch_done(&tlb_ubc_luf->arch, &tlb_ubc->arch)) {
+		tlb_ubc_luf->flush_required = false;
+		tlb_ubc_luf->writable = false;
+	}
+
+	/*
+	 * Check again after shrinking.
+	 */
+	if (!tlb_ubc_luf->flush_required)
+		return 0;
+
+	new_ugen = new_luf_ugen();
+	new_key = new_luf_key();
+
+	/*
+	 * Update the next entry of luf_batch table, that is the oldest
+	 * entry among the candidate, hopefully tlb flushes have been
+	 * done for all of the CPUs.
+	 */
+	lb = &luf_batch[new_key];
+	write_lock_irqsave(&lb->lock, flags);
+	__fold_luf_batch(lb, tlb_ubc_luf, new_ugen);
+	write_unlock_irqrestore(&lb->lock, flags);
+
+	reset_batch(tlb_ubc_luf);
+	return new_key;
+}
+
+void luf_flush(unsigned short luf_key)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct luf_batch *lb = &luf_batch[luf_key];
+	unsigned long flags;
+
+	read_lock_irqsave(&lb->lock, flags);
+	fold_batch(tlb_ubc, &lb->batch, false);
+	read_unlock_irqrestore(&lb->lock, flags);
+	try_to_unmap_flush();
+}
+EXPORT_SYMBOL(luf_flush);
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
@@ -806,8 +870,10 @@ void try_to_unmap_flush(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 
 	fold_batch(tlb_ubc, tlb_ubc_ro, true);
+	fold_batch(tlb_ubc, tlb_ubc_luf, true);
 	if (!tlb_ubc->flush_required)
 		return;
 
@@ -820,8 +886,9 @@ void try_to_unmap_flush_dirty(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 
-	if (tlb_ubc->writable || tlb_ubc_ro->writable)
+	if (tlb_ubc->writable || tlb_ubc_ro->writable || tlb_ubc_luf->writable)
 		try_to_unmap_flush();
 }
 
@@ -836,7 +903,8 @@ void try_to_unmap_flush_dirty(void)
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
-		unsigned long start, unsigned long end)
+		unsigned long start, unsigned long end,
+		struct vm_area_struct *vma)
 {
 	struct tlbflush_unmap_batch *tlb_ubc;
 	int batch;
@@ -845,7 +913,16 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 	if (!pte_accessible(mm, pteval))
 		return;
 
-	if (pte_write(pteval))
+	if (can_luf_test()) {
+		/*
+		 * luf cannot work with the folio once it found a
+		 * writable or dirty mapping on it.
+		 */
+		if (pte_write(pteval) || !can_luf_vma(vma))
+			can_luf_fail();
+	}
+
+	if (!can_luf_test())
 		tlb_ubc = &current->tlb_ubc;
 	else
 		tlb_ubc = &current->tlb_ubc_ro;
@@ -853,6 +930,21 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, start, end);
 	tlb_ubc->flush_required = true;
 
+	if (can_luf_test()) {
+		struct luf_batch *lb;
+		unsigned long flags;
+
+		/*
+		 * Accumulate to the 0th entry right away so that
+		 * luf_flush(0) can be uesed to properly perform pending
+		 * TLB flush once this unmapping is observed.
+		 */
+		lb = &luf_batch[0];
+		write_lock_irqsave(&lb->lock, flags);
+		__fold_luf_batch(lb, tlb_ubc, new_luf_ugen());
+		write_unlock_irqrestore(&lb->lock, flags);
+	}
+
 	/*
 	 * Ensure compiler does not re-order the setting of tlb_flush_batched
 	 * before the PTE is cleared.
@@ -907,6 +999,8 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
  * This must be called under the PTL so that an access to tlb_flush_batched
  * that is potentially a "reclaim vs mprotect/munmap/etc" race will synchronise
  * via the PTL.
+ *
+ * LUF(Lazy Unmap Flush) also relies on this for mprotect/munmap/etc.
  */
 void flush_tlb_batched_pending(struct mm_struct *mm)
 {
@@ -916,6 +1010,7 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 
 	if (pending != flushed) {
 		arch_flush_tlb_batched_pending(mm);
+
 		/*
 		 * If the new TLB flushing is pending during flushing, leave
 		 * mm->tlb_flush_batched as is, to avoid losing flushing.
@@ -926,7 +1021,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 }
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
-		unsigned long start, unsigned long end)
+		unsigned long start, unsigned long end,
+		struct vm_area_struct *vma)
 {
 }
 
@@ -1300,6 +1396,11 @@ int folio_mkclean(struct folio *folio)
 
 	rmap_walk(folio, &rwc);
 
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
+
 	return cleaned;
 }
 EXPORT_SYMBOL_GPL(folio_mkclean);
@@ -2146,7 +2247,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * and traps if the PTE is unmapped.
 			 */
 			if (should_defer_flush(mm, flags))
-				set_tlb_ubc_flush_pending(mm, pteval, address, end_addr);
+				set_tlb_ubc_flush_pending(mm, pteval, address, end_addr, vma);
 			else
 				flush_tlb_range(vma, address, end_addr);
 			if (pte_dirty(pteval))
@@ -2329,6 +2430,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 	mmu_notifier_invalidate_range_end(&range);
 
+	if (!ret)
+		can_luf_fail();
 	return ret;
 }
 
@@ -2361,11 +2464,21 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags)
 		.done = folio_not_mapped,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
+
+	can_luf_init(folio);
 
 	if (flags & TTU_RMAP_LOCKED)
 		rmap_walk_locked(folio, &rwc);
 	else
 		rmap_walk(folio, &rwc);
+
+	if (can_luf_test())
+		fold_batch(tlb_ubc_luf, tlb_ubc_ro, true);
+	else
+		fold_batch(tlb_ubc, tlb_ubc_ro, true);
 }
 
 /*
@@ -2533,7 +2646,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pteval, address, address + PAGE_SIZE);
+				set_tlb_ubc_flush_pending(mm, pteval, address, address + PAGE_SIZE, vma);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
@@ -2669,6 +2782,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 
 	mmu_notifier_invalidate_range_end(&range);
 
+	if (!ret)
+		can_luf_fail();
 	return ret;
 }
 
@@ -2688,6 +2803,9 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 		.done = folio_not_mapped,
 		.anon_lock = folio_lock_anon_vma_read,
 	};
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
+	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 
 	/*
 	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
@@ -2712,10 +2830,17 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 	if (!folio_test_ksm(folio) && folio_test_anon(folio))
 		rwc.invalid_vma = invalid_migration_vma;
 
+	can_luf_init(folio);
+
 	if (flags & TTU_RMAP_LOCKED)
 		rmap_walk_locked(folio, &rwc);
 	else
 		rmap_walk(folio, &rwc);
+
+	if (can_luf_test())
+		fold_batch(tlb_ubc_luf, tlb_ubc_ro, true);
+	else
+		fold_batch(tlb_ubc, tlb_ubc_ro, true);
 }
 
 #ifdef CONFIG_DEVICE_PRIVATE
diff --git a/mm/truncate.c b/mm/truncate.c
index 031d0be19f42c..68c9ded2f789b 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -138,6 +138,11 @@ void folio_invalidate(struct folio *folio, size_t offset, size_t length)
 
 	if (aops->invalidate_folio)
 		aops->invalidate_folio(folio, offset, length);
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 }
 EXPORT_SYMBOL_GPL(folio_invalidate);
 
@@ -174,6 +179,11 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
 
 	truncate_cleanup_folio(folio);
 	filemap_remove_folio(folio);
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 	return 0;
 }
 
@@ -220,6 +230,12 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 
 	if (folio_needs_release(folio))
 		folio_invalidate(folio, offset, length);
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
+
 	if (!folio_test_large(folio))
 		return true;
 
@@ -289,19 +305,28 @@ EXPORT_SYMBOL(generic_error_remove_folio);
  */
 long mapping_evict_folio(struct address_space *mapping, struct folio *folio)
 {
+	long ret = 0;
+
 	/* The page may have been truncated before it was locked */
 	if (!mapping)
-		return 0;
+		goto out;
 	if (folio_test_dirty(folio) || folio_test_writeback(folio))
-		return 0;
+		goto out;
 	/* The refcount will be elevated if any page in the folio is mapped */
 	if (folio_ref_count(folio) >
 			folio_nr_pages(folio) + folio_has_private(folio) + 1)
-		return 0;
+		goto out;
 	if (!filemap_release_folio(folio, 0))
-		return 0;
+		goto out;
 
-	return remove_mapping(mapping, folio);
+	ret = remove_mapping(mapping, folio);
+out:
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
+
+	return ret;
 }
 
 /**
@@ -341,7 +366,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	bool		same_folio;
 
 	if (mapping_empty(mapping))
-		return;
+		goto out;
 
 	/*
 	 * 'start' and 'end' always covers the range of pages to be fully
@@ -429,6 +454,12 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		truncate_folio_batch_exceptionals(mapping, &fbatch, indices);
 		folio_batch_release(&fbatch);
 	}
+
+out:
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
 
@@ -544,6 +575,11 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 		folio_batch_release(&fbatch);
 		cond_resched();
 	}
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 	return count;
 }
 
@@ -648,7 +684,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	int did_range_unmap = 0;
 
 	if (mapping_empty(mapping))
-		return 0;
+		goto out;
 
 	folio_batch_init(&fbatch);
 	index = start;
@@ -709,6 +745,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	if (dax_mapping(mapping)) {
 		unmap_mapping_pages(mapping, start, end - start + 1, false);
 	}
+out:
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c8a995a3380ac..422b9a03a6753 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -838,6 +838,8 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
  */
 long remove_mapping(struct address_space *mapping, struct folio *folio)
 {
+	long ret = 0;
+
 	if (__remove_mapping(mapping, folio, false, NULL)) {
 		/*
 		 * Unfreezing the refcount with 1 effectively
@@ -845,9 +847,15 @@ long remove_mapping(struct address_space *mapping, struct folio *folio)
 		 * atomic operation.
 		 */
 		folio_ref_unfreeze(folio, 1);
-		return folio_nr_pages(folio);
+		ret = folio_nr_pages(folio);
 	}
-	return 0;
+
+	/*
+	 * Ensure to clean stale tlb entries for this mapping.
+	 */
+	luf_flush(0);
+
+	return ret;
 }
 
 /**
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 17/25]  x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (14 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 16/25] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 18/25] mm/page_alloc: retry 3 times to take pcp pages on luf check failure Byungchul Park
                             ` (7 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

luf mechanism performs tlb shootdown for mappings that have been
unmapped in lazy manner.  However, it doesn't have to perform tlb
shootdown to cpus that already have been done by others since the tlb
shootdown was desired.

Since luf already introduced its own generation number used as a global
timestamp, luf_ugen, it's possible to selectively pick cpus that have
been done tlb flush required.

This patch introduced APIs that use the generation number to select and
remove those cpus so that it can perform tlb shootdown with a smaller
cpumask, for all the CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH archs,
x86, riscv, and arm64.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/arm64/include/asm/tlbflush.h |  26 +++++++
 arch/riscv/include/asm/tlbflush.h |   4 ++
 arch/riscv/mm/tlbflush.c          | 108 ++++++++++++++++++++++++++++++
 arch/x86/include/asm/tlbflush.h   |   4 ++
 arch/x86/mm/tlb.c                 | 108 ++++++++++++++++++++++++++++++
 include/linux/sched.h             |   1 +
 mm/internal.h                     |   4 ++
 mm/page_alloc.c                   |  32 +++++++--
 mm/rmap.c                         |  46 ++++++++++++-
 9 files changed, 327 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index f7036cd33e35c..ae3c981fcc218 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -347,6 +347,32 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	dsb(ish);
 }
 
+static inline bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen)
+{
+	/*
+	 * Nothing is needed in this architecture.
+	 */
+	return true;
+}
+
+static inline bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen)
+{
+	/*
+	 * Nothing is needed in this architecture.
+	 */
+	return true;
+}
+
+static inline void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen)
+{
+	/* nothing to do */
+}
+
+static inline void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
+{
+	/* nothing to do */
+}
+
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
 	/* nothing to do */
diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index cecd8e7e2a3bd..936bf9ce0abd9 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -64,6 +64,10 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 		struct mm_struct *mm, unsigned long start, unsigned long end);
 void arch_flush_tlb_batched_pending(struct mm_struct *mm);
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen);
 
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 38f4bea8a964a..6ce44370a8e11 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -201,3 +201,111 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	__flush_tlb_range(&batch->cpumask, FLUSH_TLB_NO_ASID, 0,
 			  FLUSH_TLB_MAX_SIZE, PAGE_SIZE);
 }
+
+static DEFINE_PER_CPU(atomic_long_t, ugen_done);
+
+static int __init luf_init_arch(void)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpu_possible_mask)
+		atomic_long_set(per_cpu_ptr(&ugen_done, cpu), LUF_UGEN_INIT - 1);
+
+	return 0;
+}
+early_initcall(luf_init_arch);
+
+/*
+ * batch will not be updated.
+ */
+bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch,
+			unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		goto out;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		unsigned long done;
+
+		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
+		if (ugen_before(done, ugen))
+			return false;
+	}
+	return true;
+out:
+	return cpumask_empty(&batch->cpumask);
+}
+
+bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
+			unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		goto out;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		unsigned long done;
+
+		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
+		if (!ugen_before(done, ugen))
+			cpumask_clear_cpu(cpu, &batch->cpumask);
+	}
+out:
+	return cpumask_empty(&batch->cpumask);
+}
+
+void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
+			     unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		return;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
+		unsigned long old = atomic_long_read(done);
+
+		/*
+		 * It's racy.  The race results in unnecessary tlb flush
+		 * because of the smaller ugen_done than it should be.
+		 * However, it's okay in terms of correctness.
+		 */
+		if (!ugen_before(old, ugen))
+			continue;
+
+		/*
+		 * It's for optimization.  Just skip on fail than retry.
+		 */
+		atomic_long_cmpxchg(done, old, ugen);
+	}
+}
+
+void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
+		unsigned long old = atomic_long_read(done);
+
+		/*
+		 * It's racy.  The race results in unnecessary tlb flush
+		 * because of the smaller ugen_done than it should be.
+		 * However, it's okay in terms of correctness.
+		 */
+		if (!ugen_before(old, ugen))
+			continue;
+
+		/*
+		 * It's for optimization.  Just skip on fail than retry.
+		 */
+		atomic_long_cmpxchg(done, old, ugen);
+	}
+}
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 52c54ca68ca9e..58ad7e6989bb1 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -293,6 +293,10 @@ static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 }
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+extern bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+extern void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
+extern void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen);
 
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 523e8bb6fba1f..be6068b60c32d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1270,6 +1270,114 @@ void __flush_tlb_all(void)
 }
 EXPORT_SYMBOL_GPL(__flush_tlb_all);
 
+static DEFINE_PER_CPU(atomic_long_t, ugen_done);
+
+static int __init luf_init_arch(void)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpu_possible_mask)
+		atomic_long_set(per_cpu_ptr(&ugen_done, cpu), LUF_UGEN_INIT - 1);
+
+	return 0;
+}
+early_initcall(luf_init_arch);
+
+/*
+ * batch will not be updated.
+ */
+bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch,
+			unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		goto out;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		unsigned long done;
+
+		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
+		if (ugen_before(done, ugen))
+			return false;
+	}
+	return true;
+out:
+	return cpumask_empty(&batch->cpumask);
+}
+
+bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
+			unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		goto out;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		unsigned long done;
+
+		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
+		if (!ugen_before(done, ugen))
+			cpumask_clear_cpu(cpu, &batch->cpumask);
+	}
+out:
+	return cpumask_empty(&batch->cpumask);
+}
+
+void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
+			     unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		return;
+
+	for_each_cpu(cpu, &batch->cpumask) {
+		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
+		unsigned long old = atomic_long_read(done);
+
+		/*
+		 * It's racy.  The race results in unnecessary tlb flush
+		 * because of the smaller ugen_done than it should be.
+		 * However, it's okay in terms of correctness.
+		 */
+		if (!ugen_before(old, ugen))
+			continue;
+
+		/*
+		 * It's for optimization.  Just skip on fail than retry.
+		 */
+		atomic_long_cmpxchg(done, old, ugen);
+	}
+}
+
+void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
+{
+	int cpu;
+
+	if (!ugen)
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
+		unsigned long old = atomic_long_read(done);
+
+		/*
+		 * It's racy.  The race results in unnecessary tlb flush
+		 * because of the smaller ugen_done than it should be.
+		 * However, it's okay in terms of correctness.
+		 */
+		if (!ugen_before(old, ugen))
+			continue;
+
+		/*
+		 * It's for optimization.  Just skip on fail than retry.
+		 */
+		atomic_long_cmpxchg(done, old, ugen);
+	}
+}
+
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	struct flush_tlb_info *info;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 47a0a3ccb7b1a..31efc88ce911a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1403,6 +1403,7 @@ struct task_struct {
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 	int luf_no_shootdown;
 	int luf_takeoff_started;
+	unsigned long luf_ugen;
 #endif
 
 	struct tlbflush_unmap_batch	tlb_ubc;
diff --git a/mm/internal.h b/mm/internal.h
index 2429db598e265..9fccfd38e03f0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1276,6 +1276,7 @@ void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
 void try_to_unmap_flush_takeoff(void);
 void flush_tlb_batched_pending(struct mm_struct *mm);
+void reset_batch(struct tlbflush_unmap_batch *batch);
 void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset);
 void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src);
 #else
@@ -1291,6 +1292,9 @@ static inline void try_to_unmap_flush_takeoff(void)
 static inline void flush_tlb_batched_pending(struct mm_struct *mm)
 {
 }
+static inline void reset_batch(struct tlbflush_unmap_batch *batch)
+{
+}
 static inline void fold_batch(struct tlbflush_unmap_batch *dst, struct tlbflush_unmap_batch *src, bool reset)
 {
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 530c5c16ab323..7b023b34d53da 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -668,9 +668,11 @@ bool luf_takeoff_start(void)
  */
 void luf_takeoff_end(void)
 {
+	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 	unsigned long flags;
 	bool no_shootdown;
 	bool outmost = false;
+	unsigned long cur_luf_ugen;
 
 	local_irq_save(flags);
 	VM_WARN_ON(!current->luf_takeoff_started);
@@ -697,10 +699,19 @@ void luf_takeoff_end(void)
 	if (no_shootdown)
 		goto out;
 
+	cur_luf_ugen = current->luf_ugen;
+
+	current->luf_ugen = 0;
+
+	if (cur_luf_ugen && arch_tlbbatch_diet(&tlb_ubc_takeoff->arch, cur_luf_ugen))
+		reset_batch(tlb_ubc_takeoff);
+
 	try_to_unmap_flush_takeoff();
 out:
-	if (outmost)
+	if (outmost) {
 		VM_WARN_ON(current->luf_no_shootdown);
+		VM_WARN_ON(current->luf_ugen);
+	}
 }
 
 /*
@@ -757,6 +768,7 @@ bool luf_takeoff_check_and_fold(struct page *page)
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 	unsigned short luf_key = page_luf_key(page);
 	struct luf_batch *lb;
+	unsigned long lb_ugen;
 	unsigned long flags;
 
 	/*
@@ -770,13 +782,25 @@ bool luf_takeoff_check_and_fold(struct page *page)
 	if (!luf_key)
 		return true;
 
-	if (current->luf_no_shootdown)
-		return false;
-
 	lb = &luf_batch[luf_key];
 	read_lock_irqsave(&lb->lock, flags);
+	lb_ugen = lb->ugen;
+
+	if (arch_tlbbatch_check_done(&lb->batch.arch, lb_ugen)) {
+		read_unlock_irqrestore(&lb->lock, flags);
+		return true;
+	}
+
+	if (current->luf_no_shootdown) {
+		read_unlock_irqrestore(&lb->lock, flags);
+		return false;
+	}
+
 	fold_batch(tlb_ubc_takeoff, &lb->batch, false);
 	read_unlock_irqrestore(&lb->lock, flags);
+
+	if (!current->luf_ugen || ugen_before(current->luf_ugen, lb_ugen))
+		current->luf_ugen = lb_ugen;
 	return true;
 }
 #endif
diff --git a/mm/rmap.c b/mm/rmap.c
index 2191cf1d38270..579c75f46c170 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -656,7 +656,7 @@ static unsigned long new_luf_ugen(void)
 	return ugen;
 }
 
-static void reset_batch(struct tlbflush_unmap_batch *batch)
+void reset_batch(struct tlbflush_unmap_batch *batch)
 {
 	arch_tlbbatch_clear(&batch->arch);
 	batch->flush_required = false;
@@ -743,8 +743,14 @@ static void __fold_luf_batch(struct luf_batch *dst_lb,
 	 * more tlb shootdown might be needed to fulfill the newer
 	 * request.  Conservertively keep the newer one.
 	 */
-	if (!dst_lb->ugen || ugen_before(dst_lb->ugen, src_ugen))
+	if (!dst_lb->ugen || ugen_before(dst_lb->ugen, src_ugen)) {
+		/*
+		 * Good chance to shrink the batch using the old ugen.
+		 */
+		if (dst_lb->ugen && arch_tlbbatch_diet(&dst_lb->batch.arch, dst_lb->ugen))
+			reset_batch(&dst_lb->batch);
 		dst_lb->ugen = src_ugen;
+	}
 	fold_batch(&dst_lb->batch, src_batch, false);
 }
 
@@ -772,17 +778,45 @@ void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
 	read_unlock_irqrestore(&src->lock, flags);
 }
 
+static unsigned long tlb_flush_start(void)
+{
+	/*
+	 * Memory barrier implied in the atomic operation prevents
+	 * reading luf_ugen from happening after the following
+	 * tlb flush.
+	 */
+	return new_luf_ugen();
+}
+
+static void tlb_flush_end(struct arch_tlbflush_unmap_batch *arch,
+		struct mm_struct *mm, unsigned long ugen)
+{
+	/*
+	 * Prevent the following marking from placing prior to the
+	 * actual tlb flush.
+	 */
+	smp_mb();
+
+	if (arch)
+		arch_tlbbatch_mark_ugen(arch, ugen);
+	if (mm)
+		arch_mm_mark_ugen(mm, ugen);
+}
+
 void try_to_unmap_flush_takeoff(void)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
+	unsigned long ugen;
 
 	if (!tlb_ubc_takeoff->flush_required)
 		return;
 
+	ugen = tlb_flush_start();
 	arch_tlbbatch_flush(&tlb_ubc_takeoff->arch);
+	tlb_flush_end(&tlb_ubc_takeoff->arch, NULL, ugen);
 
 	/*
 	 * Now that tlb shootdown of tlb_ubc_takeoff has been performed,
@@ -871,13 +905,17 @@ void try_to_unmap_flush(void)
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	struct tlbflush_unmap_batch *tlb_ubc_ro = &current->tlb_ubc_ro;
 	struct tlbflush_unmap_batch *tlb_ubc_luf = &current->tlb_ubc_luf;
+	unsigned long ugen;
 
 	fold_batch(tlb_ubc, tlb_ubc_ro, true);
 	fold_batch(tlb_ubc, tlb_ubc_luf, true);
 	if (!tlb_ubc->flush_required)
 		return;
 
+	ugen = tlb_flush_start();
 	arch_tlbbatch_flush(&tlb_ubc->arch);
+	tlb_flush_end(&tlb_ubc->arch, NULL, ugen);
+
 	reset_batch(tlb_ubc);
 }
 
@@ -1009,7 +1047,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 	int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
 
 	if (pending != flushed) {
+		unsigned long ugen;
+
+		ugen = tlb_flush_start();
 		arch_flush_tlb_batched_pending(mm);
+		tlb_flush_end(NULL, mm, ugen);
 
 		/*
 		 * If the new TLB flushing is pending during flushing, leave
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 18/25]  mm/page_alloc: retry 3 times to take pcp pages on luf check failure
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (15 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 17/25] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 19/25] mm: skip luf tlb flush for luf'd mm that already has been done Byungchul Park
                             ` (6 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/page_alloc.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7b023b34d53da..f35ae2550019f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3384,6 +3384,12 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 {
 	struct page *page;
 
+	/*
+	 * give up taking page from pcp if it fails to take pcp page
+	 * 3 times due to the tlb shootdownable issue.
+	 */
+	int try_luf_pages = 3;
+
 	do {
 		if (list_empty(list)) {
 			int batch = nr_pcp_alloc(pcp, zone, order);
@@ -3398,11 +3404,21 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 				return NULL;
 		}
 
-		page = list_first_entry(list, struct page, pcp_list);
-		if (!luf_takeoff_check_and_fold(page))
+		list_for_each_entry(page, list, pcp_list) {
+			if (luf_takeoff_check_and_fold(page)) {
+				list_del(&page->pcp_list);
+				pcp->count -= 1 << order;
+				break;
+			}
+			if (!--try_luf_pages)
+				return NULL;
+		}
+
+		/*
+		 * If all the pages in the list fails...
+		 */
+		if (list_entry_is_head(page, list, pcp_list))
 			return NULL;
-		list_del(&page->pcp_list);
-		pcp->count -= 1 << order;
 	} while (check_new_pages(page, order));
 
 	return page;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 19/25] mm: skip luf tlb flush for luf'd mm that already has been done
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (16 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 18/25] mm/page_alloc: retry 3 times to take pcp pages on luf check failure Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 20/25] mm, fs: skip tlb flushes for luf'd filemap " Byungchul Park
                             ` (5 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Fault hander performs tlb flush pended by luf when a new pte becomes
to have write permission, no matter whether tlb flush required has been
performed or not.

By storing luf generation number, luf_ugen, in struct mm_struct, we can
skip unnecessary tlb flush.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/asm-generic/tlb.h |  2 +-
 include/linux/mm_types.h  |  9 +++++
 kernel/fork.c             |  1 +
 kernel/sched/core.c       |  2 +-
 mm/memory.c               | 22 ++++++++++--
 mm/pgtable-generic.c      |  2 +-
 mm/rmap.c                 | 74 +++++++++++++++++++++++++++++++++++++--
 7 files changed, 104 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 4b7d29d8ea794..5be3487bd9192 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -570,7 +570,7 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
 	/*
 	 * Don't leave stale tlb entries for this vma.
 	 */
-	luf_flush(0);
+	luf_flush_vma(vma);
 
 	if (tlb->fullmm || IS_ENABLED(CONFIG_MMU_GATHER_MERGE_VMAS))
 		return;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 117f8e822e969..c32ef19a25056 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -39,8 +39,10 @@ struct luf_batch {
 	unsigned long ugen;
 	rwlock_t lock;
 };
+void luf_batch_init(struct luf_batch *lb);
 #else
 struct luf_batch {};
+static inline void luf_batch_init(struct luf_batch *lb) {}
 #endif
 
 /*
@@ -1073,6 +1075,9 @@ struct mm_struct {
 		 * moving a PROT_NONE mapped page.
 		 */
 		atomic_t tlb_flush_pending;
+
+		/* luf batch for this mm */
+		struct luf_batch luf_batch;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 		/* See flush_tlb_batched_pending() */
 		atomic_t tlb_flush_batched;
@@ -1355,8 +1360,12 @@ extern void tlb_finish_mmu(struct mmu_gather *tlb);
 
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 void luf_flush(unsigned short luf_key);
+void luf_flush_mm(struct mm_struct *mm);
+void luf_flush_vma(struct vm_area_struct *vma);
 #else
 static inline void luf_flush(unsigned short luf_key) {}
+static inline void luf_flush_mm(struct mm_struct *mm) {}
+static inline void luf_flush_vma(struct vm_area_struct *vma) {}
 #endif
 
 struct vm_fault;
diff --git a/kernel/fork.c b/kernel/fork.c
index 364b2d4fd3efa..15274eabc727d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1265,6 +1265,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
 	spin_lock_init(&mm->arg_lock);
+	luf_batch_init(&mm->luf_batch);
 	mm_init_cpumask(mm);
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1f4c5da800365..ec132abbbce6e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5275,7 +5275,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	if (mm) {
 		membarrier_mm_sync_core_before_usermode(mm);
 		mmdrop_lazy_tlb_sched(mm);
-		luf_flush(0);
+		luf_flush_mm(mm);
 	}
 
 	if (unlikely(prev_state == TASK_DEAD)) {
diff --git a/mm/memory.c b/mm/memory.c
index e496d8deb887f..93e5879583b07 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6216,6 +6216,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	struct mm_struct *mm = vma->vm_mm;
 	vm_fault_t ret;
 	bool is_droppable;
+	struct address_space *mapping = NULL;
 	bool flush = false;
 
 	__set_current_state(TASK_RUNNING);
@@ -6247,9 +6248,17 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	 * should be considered.
 	 */
 	if (vma->vm_flags & (VM_WRITE | VM_MAYWRITE) ||
-			flags & FAULT_FLAG_WRITE)
+			flags & FAULT_FLAG_WRITE) {
 		flush = true;
 
+		/*
+		 * Doesn't care the !VM_SHARED cases because it won't
+		 * update the pages that might be shared with others.
+		 */
+		if (vma->vm_flags & VM_SHARED && vma->vm_file)
+			mapping = vma->vm_file->f_mapping;
+	}
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
 	else
@@ -6284,8 +6293,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/*
 	 * Ensure to clean stale tlb entries for this vma.
 	 */
-	if (flush)
-		luf_flush(0);
+	if (flush) {
+		/*
+		 * If it has a VM_SHARED mapping, all the mms involved
+		 * should be luf_flush'ed.
+		 */
+		if (mapping)
+			luf_flush(0);
+		luf_flush_mm(mm);
+	}
 
 	return ret;
 }
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d6678d6bac746..545d401db82c1 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -100,7 +100,7 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 	if (pte_accessible(mm, pte))
 		flush_tlb_page(vma, address);
 	else
-		luf_flush(0);
+		luf_flush_vma(vma);
 	return pte;
 }
 #endif
diff --git a/mm/rmap.c b/mm/rmap.c
index 579c75f46c170..fe9c4606ae542 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -695,7 +695,7 @@ void fold_batch(struct tlbflush_unmap_batch *dst,
  */
 struct luf_batch luf_batch[NR_LUF_BATCH];
 
-static void luf_batch_init(struct luf_batch *lb)
+void luf_batch_init(struct luf_batch *lb)
 {
 	rwlock_init(&lb->lock);
 	reset_batch(&lb->batch);
@@ -778,6 +778,31 @@ void fold_luf_batch(struct luf_batch *dst, struct luf_batch *src)
 	read_unlock_irqrestore(&src->lock, flags);
 }
 
+static void fold_luf_batch_mm(struct luf_batch *dst,
+		struct mm_struct *mm)
+{
+	unsigned long flags;
+	bool need_fold = false;
+
+	read_lock_irqsave(&dst->lock, flags);
+	if (arch_tlbbatch_need_fold(&dst->batch.arch, mm))
+		need_fold = true;
+	read_unlock(&dst->lock);
+
+	write_lock(&dst->lock);
+	if (unlikely(need_fold))
+		arch_tlbbatch_add_pending(&dst->batch.arch, mm, 0, -1UL);
+
+	/*
+	 * dst->ugen represents sort of request for tlb shootdown.  The
+	 * newer it is, the more tlb shootdown might be needed to
+	 * fulfill the newer request.  Keep the newest one not to miss
+	 * necessary tlb shootdown.
+	 */
+	dst->ugen = new_luf_ugen();
+	write_unlock_irqrestore(&dst->lock, flags);
+}
+
 static unsigned long tlb_flush_start(void)
 {
 	/*
@@ -894,6 +919,49 @@ void luf_flush(unsigned short luf_key)
 }
 EXPORT_SYMBOL(luf_flush);
 
+void luf_flush_vma(struct vm_area_struct *vma)
+{
+	struct mm_struct *mm;
+	struct address_space *mapping = NULL;
+
+	if (!vma)
+		return;
+
+	mm = vma->vm_mm;
+	/*
+	 * Doesn't care the !VM_SHARED cases because it won't
+	 * update the pages that might be shared with others.
+	 */
+	if (vma->vm_flags & VM_SHARED && vma->vm_file)
+		mapping = vma->vm_file->f_mapping;
+
+	if (mapping)
+		luf_flush(0);
+	luf_flush_mm(mm);
+}
+
+void luf_flush_mm(struct mm_struct *mm)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct luf_batch *lb;
+	unsigned long flags;
+	unsigned long lb_ugen;
+
+	if (!mm)
+		return;
+
+	lb = &mm->luf_batch;
+	read_lock_irqsave(&lb->lock, flags);
+	fold_batch(tlb_ubc, &lb->batch, false);
+	lb_ugen = lb->ugen;
+	read_unlock_irqrestore(&lb->lock, flags);
+
+	if (arch_tlbbatch_diet(&tlb_ubc->arch, lb_ugen))
+		return;
+
+	try_to_unmap_flush();
+}
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
@@ -962,8 +1030,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 
 	if (!can_luf_test())
 		tlb_ubc = &current->tlb_ubc;
-	else
+	else {
 		tlb_ubc = &current->tlb_ubc_ro;
+		fold_luf_batch_mm(&mm->luf_batch, mm);
+	}
 
 	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, start, end);
 	tlb_ubc->flush_required = true;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 20/25] mm, fs: skip tlb flushes for luf'd filemap that already has been done
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (17 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 19/25] mm: skip luf tlb flush for luf'd mm that already has been done Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 21/25] mm: perform luf tlb shootdown per zone in batched manner Byungchul Park
                             ` (4 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

For luf'd filemap, tlb shootdown is performed when updating page cache,
no matter whether tlb flushes required already has been done or not.

By storing luf meta data in struct address_space and updating the luf
meta data properly, we can skip unnecessary tlb flush.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 fs/inode.c               |  1 +
 include/linux/fs.h       |  4 ++-
 include/linux/mm_types.h |  2 ++
 mm/memory.c              |  4 +--
 mm/rmap.c                | 59 +++++++++++++++++++++++++---------------
 mm/truncate.c            | 14 +++++-----
 mm/vmscan.c              |  2 +-
 7 files changed, 53 insertions(+), 33 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 5587aabdaa5ee..752fb2df6f3b3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -475,6 +475,7 @@ static void __address_space_init_once(struct address_space *mapping)
 	init_rwsem(&mapping->i_mmap_rwsem);
 	INIT_LIST_HEAD(&mapping->i_private_list);
 	spin_lock_init(&mapping->i_private_lock);
+	luf_batch_init(&mapping->luf_batch);
 	mapping->i_mmap = RB_ROOT_CACHED;
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 78aaf769d32d1..a2f014b31028f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -498,6 +498,7 @@ extern const struct address_space_operations empty_aops;
  * @i_private_lock: For use by the owner of the address_space.
  * @i_private_list: For use by the owner of the address_space.
  * @i_private_data: For use by the owner of the address_space.
+ * @luf_batch: Data to track need of tlb flush by luf.
  */
 struct address_space {
 	struct inode		*host;
@@ -519,6 +520,7 @@ struct address_space {
 	struct list_head	i_private_list;
 	struct rw_semaphore	i_mmap_rwsem;
 	void *			i_private_data;
+	struct luf_batch	luf_batch;
 } __attribute__((aligned(sizeof(long)))) __randomize_layout;
 	/*
 	 * On most architectures that alignment is already the case; but
@@ -545,7 +547,7 @@ static inline int mapping_write_begin(struct file *file,
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
 	if (!ret)
-		luf_flush(0);
+		luf_flush_mapping(mapping);
 
 	return ret;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c32ef19a25056..d73a3eb0f7b21 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1362,10 +1362,12 @@ extern void tlb_finish_mmu(struct mmu_gather *tlb);
 void luf_flush(unsigned short luf_key);
 void luf_flush_mm(struct mm_struct *mm);
 void luf_flush_vma(struct vm_area_struct *vma);
+void luf_flush_mapping(struct address_space *mapping);
 #else
 static inline void luf_flush(unsigned short luf_key) {}
 static inline void luf_flush_mm(struct mm_struct *mm) {}
 static inline void luf_flush_vma(struct vm_area_struct *vma) {}
+static inline void luf_flush_mapping(struct address_space *mapping) {}
 #endif
 
 struct vm_fault;
diff --git a/mm/memory.c b/mm/memory.c
index 93e5879583b07..62137ab258d2c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6296,10 +6296,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	if (flush) {
 		/*
 		 * If it has a VM_SHARED mapping, all the mms involved
-		 * should be luf_flush'ed.
+		 * in the struct address_space should be luf_flush'ed.
 		 */
 		if (mapping)
-			luf_flush(0);
+			luf_flush_mapping(mapping);
 		luf_flush_mm(mm);
 	}
 
diff --git a/mm/rmap.c b/mm/rmap.c
index fe9c4606ae542..f5c5190be24e0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -691,7 +691,7 @@ void fold_batch(struct tlbflush_unmap_batch *dst,
 #define NR_LUF_BATCH (1 << (sizeof(short) * 8))
 
 /*
- * Use 0th entry as accumulated batch.
+ * XXX: Reserve the 0th entry for later use.
  */
 struct luf_batch luf_batch[NR_LUF_BATCH];
 
@@ -936,7 +936,7 @@ void luf_flush_vma(struct vm_area_struct *vma)
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping)
-		luf_flush(0);
+		luf_flush_mapping(mapping);
 	luf_flush_mm(mm);
 }
 
@@ -962,6 +962,29 @@ void luf_flush_mm(struct mm_struct *mm)
 	try_to_unmap_flush();
 }
 
+void luf_flush_mapping(struct address_space *mapping)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
+	struct luf_batch *lb;
+	unsigned long flags;
+	unsigned long lb_ugen;
+
+	if (!mapping)
+		return;
+
+	lb = &mapping->luf_batch;
+	read_lock_irqsave(&lb->lock, flags);
+	fold_batch(tlb_ubc, &lb->batch, false);
+	lb_ugen = lb->ugen;
+	read_unlock_irqrestore(&lb->lock, flags);
+
+	if (arch_tlbbatch_diet(&tlb_ubc->arch, lb_ugen))
+		return;
+
+	try_to_unmap_flush();
+}
+EXPORT_SYMBOL(luf_flush_mapping);
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
@@ -1010,7 +1033,8 @@ void try_to_unmap_flush_dirty(void)
 
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 		unsigned long start, unsigned long end,
-		struct vm_area_struct *vma)
+		struct vm_area_struct *vma,
+		struct address_space *mapping)
 {
 	struct tlbflush_unmap_batch *tlb_ubc;
 	int batch;
@@ -1032,27 +1056,15 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 		tlb_ubc = &current->tlb_ubc;
 	else {
 		tlb_ubc = &current->tlb_ubc_ro;
+
 		fold_luf_batch_mm(&mm->luf_batch, mm);
+		if (mapping)
+			fold_luf_batch_mm(&mapping->luf_batch, mm);
 	}
 
 	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, start, end);
 	tlb_ubc->flush_required = true;
 
-	if (can_luf_test()) {
-		struct luf_batch *lb;
-		unsigned long flags;
-
-		/*
-		 * Accumulate to the 0th entry right away so that
-		 * luf_flush(0) can be uesed to properly perform pending
-		 * TLB flush once this unmapping is observed.
-		 */
-		lb = &luf_batch[0];
-		write_lock_irqsave(&lb->lock, flags);
-		__fold_luf_batch(lb, tlb_ubc, new_luf_ugen());
-		write_unlock_irqrestore(&lb->lock, flags);
-	}
-
 	/*
 	 * Ensure compiler does not re-order the setting of tlb_flush_batched
 	 * before the PTE is cleared.
@@ -1134,7 +1146,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 		unsigned long start, unsigned long end,
-		struct vm_area_struct *vma)
+		struct vm_area_struct *vma,
+		struct address_space *mapping)
 {
 }
 
@@ -1511,7 +1524,7 @@ int folio_mkclean(struct folio *folio)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 
 	return cleaned;
 }
@@ -2198,6 +2211,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	unsigned long nr_pages = 1, end_addr;
 	unsigned long pfn;
 	unsigned long hsz = 0;
+	struct address_space *mapping = folio_mapping(folio);
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2359,7 +2373,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * and traps if the PTE is unmapped.
 			 */
 			if (should_defer_flush(mm, flags))
-				set_tlb_ubc_flush_pending(mm, pteval, address, end_addr, vma);
+				set_tlb_ubc_flush_pending(mm, pteval, address, end_addr, vma, mapping);
 			else
 				flush_tlb_range(vma, address, end_addr);
 			if (pte_dirty(pteval))
@@ -2611,6 +2625,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long pfn;
 	unsigned long hsz = 0;
+	struct address_space *mapping = folio_mapping(folio);
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2758,7 +2773,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pteval, address, address + PAGE_SIZE, vma);
+				set_tlb_ubc_flush_pending(mm, pteval, address, address + PAGE_SIZE, vma, mapping);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
diff --git a/mm/truncate.c b/mm/truncate.c
index 68c9ded2f789b..8c133b93cefe8 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -142,7 +142,7 @@ void folio_invalidate(struct folio *folio, size_t offset, size_t length)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(folio->mapping);
 }
 EXPORT_SYMBOL_GPL(folio_invalidate);
 
@@ -183,7 +183,7 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 	return 0;
 }
 
@@ -234,7 +234,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(folio->mapping);
 
 	if (!folio_test_large(folio))
 		return true;
@@ -324,7 +324,7 @@ long mapping_evict_folio(struct address_space *mapping, struct folio *folio)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 
 	return ret;
 }
@@ -459,7 +459,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
 
@@ -579,7 +579,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 	return count;
 }
 
@@ -749,7 +749,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 422b9a03a6753..f145c09629b97 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -853,7 +853,7 @@ long remove_mapping(struct address_space *mapping, struct folio *folio)
 	/*
 	 * Ensure to clean stale tlb entries for this mapping.
 	 */
-	luf_flush(0);
+	luf_flush_mapping(mapping);
 
 	return ret;
 }
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 21/25] mm: perform luf tlb shootdown per zone in batched manner
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (18 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 20/25] mm, fs: skip tlb flushes for luf'd filemap " Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 22/25] mm/page_alloc: not allow to tlb shootdown if !preemptable() && non_luf_pages_ok() Byungchul Park
                             ` (3 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Each luf page in buddy has its pending tlb shootdown information and
performs the corresponding tlb shootdown on exit from buddy.  However,
every exit from buddy causes small but frequent IPIs.  Even though total
IPIs get reduced, unnecessary waits on conflict CPUs in IPI handler have
been observed via perf profiling.

Thus, made it perfrom luf tlb shootdown per zone in batched manner when
pages exit from buddy so as to avoid frequent IPIs.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm.h       |  44 ++++-
 include/linux/mm_types.h |  19 +-
 include/linux/mmzone.h   |   9 +
 include/linux/sched.h    |   2 +
 mm/compaction.c          |  10 +-
 mm/internal.h            |  13 +-
 mm/mm_init.c             |   5 +
 mm/page_alloc.c          | 363 +++++++++++++++++++++++++++++++--------
 mm/page_reporting.c      |   9 +-
 mm/rmap.c                |   6 +-
 10 files changed, 383 insertions(+), 97 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74a37cb132caa..2fa5185880105 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4240,12 +4240,16 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
 int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
 int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
 
-#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 /*
  * luf_ugen will start with 2 so that 1 can be regarded as a passed one.
  */
 #define LUF_UGEN_INIT 2
+/*
+ * zone_ugen will start with 2 so that 1 can be regarded as done.
+ */
+#define ZONE_UGEN_INIT 2
 
+#if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 static inline bool ugen_before(unsigned long a, unsigned long b)
 {
 	/*
@@ -4256,7 +4260,11 @@ static inline bool ugen_before(unsigned long a, unsigned long b)
 
 static inline unsigned long next_ugen(unsigned long ugen)
 {
-	if (ugen + 1)
+	/*
+	 * Avoid zero even in unsigned short range so as to treat
+	 * '(unsigned short)ugen == 0' as invalid.
+	 */
+	if ((unsigned short)(ugen + 1))
 		return ugen + 1;
 	/*
 	 * Avoid invalid ugen, zero.
@@ -4266,7 +4274,11 @@ static inline unsigned long next_ugen(unsigned long ugen)
 
 static inline unsigned long prev_ugen(unsigned long ugen)
 {
-	if (ugen - 1)
+	/*
+	 * Avoid zero even in unsigned short range so as to treat
+	 * '(unsigned short)ugen == 0' as invalid.
+	 */
+	if ((unsigned short)(ugen - 1))
 		return ugen - 1;
 	/*
 	 * Avoid invalid ugen, zero.
@@ -4274,4 +4286,30 @@ static inline unsigned long prev_ugen(unsigned long ugen)
 	return ugen - 2;
 }
 #endif
+
+/*
+ * return the biggest ugen but it should be before the real zone_ugen.
+ */
+static inline unsigned long page_zone_ugen(struct zone *zone, struct page *page)
+{
+	unsigned long zone_ugen = zone->zone_ugen;
+	unsigned short short_zone_ugen = page->zone_ugen;
+	unsigned long cand1, cand2;
+
+	if (!short_zone_ugen)
+		return 0;
+
+	cand1 = (zone_ugen & ~(unsigned long)USHRT_MAX) | short_zone_ugen;
+	cand2 = cand1 - USHRT_MAX - 1;
+
+	if (!ugen_before(zone_ugen, cand1))
+		return cand1;
+
+	return cand2;
+}
+
+static inline void set_page_zone_ugen(struct page *page, unsigned short zone_ugen)
+{
+	page->zone_ugen = zone_ugen;
+}
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d73a3eb0f7b21..a1d80ffafe338 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -133,11 +133,20 @@ struct page {
 					 */
 					unsigned short order;
 
-					/*
-					 * For tracking need of tlb flush,
-					 * by luf(lazy unmap flush).
-					 */
-					unsigned short luf_key;
+					union {
+						/*
+						 * For tracking need of
+						 * tlb flush, by
+						 * luf(lazy unmap flush).
+						 */
+						unsigned short luf_key;
+
+						/*
+						 * Casted zone_ugen with
+						 * unsigned short.
+						 */
+						unsigned short zone_ugen;
+					};
 				};
 			};
 		};
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9294cbbe698fc..3f2a79631fedf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -117,6 +117,7 @@ extern int page_group_by_mobility_disabled;
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	struct list_head	pend_list[MIGRATE_TYPES];
+	unsigned long		pend_zone_ugen[MIGRATE_TYPES];
 	unsigned long		nr_free;
 };
 
@@ -1017,6 +1018,14 @@ struct zone {
 	atomic_long_t		vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
 	/* Count pages that need tlb shootdown on allocation */
 	atomic_long_t		nr_luf_pages;
+	/* Generation number for that tlb shootdown has been done */
+	unsigned long		zone_ugen_done;
+	/* Generation number to control zone batched tlb shootdown */
+	unsigned long		zone_ugen;
+	/* Approximate latest luf_ugen that have ever entered */
+	unsigned long		luf_ugen;
+	/* Accumulated tlb batch for this zone */
+	struct tlbflush_unmap_batch zone_batch;
 } ____cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 31efc88ce911a..96375274d0335 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1404,6 +1404,8 @@ struct task_struct {
 	int luf_no_shootdown;
 	int luf_takeoff_started;
 	unsigned long luf_ugen;
+	unsigned long zone_ugen;
+	unsigned long wait_zone_ugen;
 #endif
 
 	struct tlbflush_unmap_batch	tlb_ubc;
diff --git a/mm/compaction.c b/mm/compaction.c
index 5dfa53252d75b..c87a1803b10e2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -655,7 +655,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 				goto isolate_fail;
 		}
 
-		if (!luf_takeoff_check(page))
+		if (!luf_takeoff_check(cc->zone, page))
 			goto isolate_fail;
 
 		/* Found a free page, will break it into order-0 pages */
@@ -691,7 +691,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(cc->zone);
 
 	/*
 	 * Be careful to not go outside of the pageblock.
@@ -1613,7 +1613,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 			order_scanned++;
 			nr_scanned++;
 
-			if (unlikely(consider_pend && !luf_takeoff_check(freepage)))
+			if (unlikely(consider_pend && !luf_takeoff_check(cc->zone, freepage)))
 				goto scan_next;
 
 			pfn = page_to_pfn(freepage);
@@ -1681,7 +1681,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 		/*
 		 * Check and flush before using the pages taken off.
 		 */
-		luf_takeoff_end();
+		luf_takeoff_end(cc->zone);
 
 		/* Skip fast search if enough freepages isolated */
 		if (cc->nr_freepages >= cc->nr_migratepages)
@@ -2419,7 +2419,7 @@ static enum compact_result compact_finished(struct compact_control *cc)
 	 */
 	luf_takeoff_start();
 	ret = __compact_finished(cc);
-	luf_takeoff_end();
+	luf_takeoff_end(cc->zone);
 
 	trace_mm_compaction_finished(cc->zone, cc->order, ret);
 	if (ret == COMPACT_NO_SUITABLE_PAGE)
diff --git a/mm/internal.h b/mm/internal.h
index 9fccfd38e03f0..53056ad7dade9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1610,10 +1610,10 @@ static inline void accept_page(struct page *page)
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 extern struct luf_batch luf_batch[];
 bool luf_takeoff_start(void);
-void luf_takeoff_end(void);
+void luf_takeoff_end(struct zone *zone);
 bool luf_takeoff_no_shootdown(void);
-bool luf_takeoff_check(struct page *page);
-bool luf_takeoff_check_and_fold(struct page *page);
+bool luf_takeoff_check(struct zone *zone, struct page *page);
+bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page);
 
 static inline bool non_luf_pages_ok(struct zone *zone)
 {
@@ -1623,7 +1623,6 @@ static inline bool non_luf_pages_ok(struct zone *zone)
 
 	return nr_free - nr_luf_pages > min_wm;
 }
-
 unsigned short fold_unmap_luf(void);
 
 /*
@@ -1711,10 +1710,10 @@ static inline bool can_luf_vma(struct vm_area_struct *vma)
 }
 #else /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 static inline bool luf_takeoff_start(void) { return false; }
-static inline void luf_takeoff_end(void) {}
+static inline void luf_takeoff_end(struct zone *zone) {}
 static inline bool luf_takeoff_no_shootdown(void) { return true; }
-static inline bool luf_takeoff_check(struct page *page) { return true; }
-static inline bool luf_takeoff_check_and_fold(struct page *page) { return true; }
+static inline bool luf_takeoff_check(struct zone *zone, struct page *page) { return true; }
+static inline bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page) { return true; }
 static inline bool non_luf_pages_ok(struct zone *zone) { return true; }
 static inline unsigned short fold_unmap_luf(void) { return 0; }
 
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 81c5060496112..f067d82f797be 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1422,6 +1422,7 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		INIT_LIST_HEAD(&zone->free_area[order].pend_list[t]);
+		zone->free_area[order].pend_zone_ugen[t] = ZONE_UGEN_INIT;
 		zone->free_area[order].nr_free = 0;
 	}
 
@@ -1429,6 +1430,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	INIT_LIST_HEAD(&zone->unaccepted_pages);
 #endif
 	atomic_long_set(&zone->nr_luf_pages, 0);
+	zone->zone_ugen_done = ZONE_UGEN_INIT - 1;
+	zone->zone_ugen = ZONE_UGEN_INIT;
+	zone->luf_ugen = LUF_UGEN_INIT - 1;
+	reset_batch(&zone->zone_batch);
 }
 
 void __meminit init_currently_empty_zone(struct zone *zone,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f35ae2550019f..0f986cfa4fe39 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -663,16 +663,29 @@ bool luf_takeoff_start(void)
 	return !no_shootdown;
 }
 
+static void wait_zone_ugen_done(struct zone *zone, unsigned long zone_ugen)
+{
+	while (ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen))
+		cond_resched();
+}
+
+static void set_zone_ugen_done(struct zone *zone, unsigned long zone_ugen)
+{
+	WRITE_ONCE(zone->zone_ugen_done, zone_ugen);
+}
+
 /*
  * Should be called within the same context of luf_takeoff_start().
  */
-void luf_takeoff_end(void)
+void luf_takeoff_end(struct zone *zone)
 {
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
 	unsigned long flags;
 	bool no_shootdown;
 	bool outmost = false;
 	unsigned long cur_luf_ugen;
+	unsigned long cur_zone_ugen;
+	unsigned long cur_wait_zone_ugen;
 
 	local_irq_save(flags);
 	VM_WARN_ON(!current->luf_takeoff_started);
@@ -700,6 +713,8 @@ void luf_takeoff_end(void)
 		goto out;
 
 	cur_luf_ugen = current->luf_ugen;
+	cur_zone_ugen = current->zone_ugen;
+	cur_wait_zone_ugen = current->wait_zone_ugen;
 
 	current->luf_ugen = 0;
 
@@ -707,10 +722,38 @@ void luf_takeoff_end(void)
 		reset_batch(tlb_ubc_takeoff);
 
 	try_to_unmap_flush_takeoff();
+
+	if (cur_wait_zone_ugen || cur_zone_ugen) {
+		/*
+		 * pcp(zone == NULL) doesn't work with zone batch.
+		 */
+		if (zone) {
+			current->zone_ugen = 0;
+			current->wait_zone_ugen = 0;
+
+			/*
+			 * Guarantee that tlb shootdown required for the
+			 * zone_ugen has been completed once observing
+			 * 'zone_ugen_done'.
+			 */
+			smp_mb();
+
+			/*
+			 * zone->zone_ugen_done should be updated
+			 * sequentially.
+			 */
+			if (cur_wait_zone_ugen)
+				wait_zone_ugen_done(zone, cur_wait_zone_ugen);
+			if (cur_zone_ugen)
+				set_zone_ugen_done(zone, cur_zone_ugen);
+		}
+	}
 out:
 	if (outmost) {
 		VM_WARN_ON(current->luf_no_shootdown);
 		VM_WARN_ON(current->luf_ugen);
+		VM_WARN_ON(current->zone_ugen);
+		VM_WARN_ON(current->wait_zone_ugen);
 	}
 }
 
@@ -741,9 +784,9 @@ bool luf_takeoff_no_shootdown(void)
  * Should be called with either zone lock held and irq disabled or pcp
  * lock held.
  */
-bool luf_takeoff_check(struct page *page)
+bool luf_takeoff_check(struct zone *zone, struct page *page)
 {
-	unsigned short luf_key = page_luf_key(page);
+	unsigned long zone_ugen;
 
 	/*
 	 * No way.  Delimit using luf_takeoff_{start,end}().
@@ -753,7 +796,29 @@ bool luf_takeoff_check(struct page *page)
 		return false;
 	}
 
-	if (!luf_key)
+	if (!zone) {
+		unsigned short luf_key = page_luf_key(page);
+
+		if (!luf_key)
+			return true;
+
+		if (current->luf_no_shootdown)
+			return false;
+
+		return true;
+	}
+
+	zone_ugen = page_zone_ugen(zone, page);
+	if (!zone_ugen)
+		return true;
+
+	/*
+	 * Should not be zero since zone-zone_ugen has been updated in
+	 * __free_one_page() -> update_zone_batch().
+	 */
+	VM_WARN_ON(!zone->zone_ugen);
+
+	if (!ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen))
 		return true;
 
 	return !current->luf_no_shootdown;
@@ -763,13 +828,11 @@ bool luf_takeoff_check(struct page *page)
  * Should be called with either zone lock held and irq disabled or pcp
  * lock held.
  */
-bool luf_takeoff_check_and_fold(struct page *page)
+bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 {
 	struct tlbflush_unmap_batch *tlb_ubc_takeoff = &current->tlb_ubc_takeoff;
-	unsigned short luf_key = page_luf_key(page);
-	struct luf_batch *lb;
-	unsigned long lb_ugen;
 	unsigned long flags;
+	unsigned long zone_ugen;
 
 	/*
 	 * No way.  Delimit using luf_takeoff_{start,end}().
@@ -779,28 +842,94 @@ bool luf_takeoff_check_and_fold(struct page *page)
 		return false;
 	}
 
-	if (!luf_key)
-		return true;
+	/*
+	 * pcp case
+	 */
+	if (!zone) {
+		unsigned short luf_key = page_luf_key(page);
+		struct luf_batch *lb;
+		unsigned long lb_ugen;
 
-	lb = &luf_batch[luf_key];
-	read_lock_irqsave(&lb->lock, flags);
-	lb_ugen = lb->ugen;
+		if (!luf_key)
+			return true;
+
+		lb = &luf_batch[luf_key];
+		read_lock_irqsave(&lb->lock, flags);
+		lb_ugen = lb->ugen;
+
+		if (arch_tlbbatch_check_done(&lb->batch.arch, lb_ugen)) {
+			read_unlock_irqrestore(&lb->lock, flags);
+			return true;
+		}
+
+		if (current->luf_no_shootdown) {
+			read_unlock_irqrestore(&lb->lock, flags);
+			return false;
+		}
 
-	if (arch_tlbbatch_check_done(&lb->batch.arch, lb_ugen)) {
+		fold_batch(tlb_ubc_takeoff, &lb->batch, false);
 		read_unlock_irqrestore(&lb->lock, flags);
+
+		if (!current->luf_ugen || ugen_before(current->luf_ugen, lb_ugen))
+			current->luf_ugen = lb_ugen;
 		return true;
 	}
 
-	if (current->luf_no_shootdown) {
-		read_unlock_irqrestore(&lb->lock, flags);
+	zone_ugen = page_zone_ugen(zone, page);
+	if (!zone_ugen)
+		return true;
+
+	/*
+	 * Should not be zero since zone-zone_ugen has been updated in
+	 * __free_one_page() -> update_zone_batch().
+	 */
+	VM_WARN_ON(!zone->zone_ugen);
+
+	if (!ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen))
+		return true;
+
+	if (current->luf_no_shootdown)
 		return false;
-	}
 
-	fold_batch(tlb_ubc_takeoff, &lb->batch, false);
-	read_unlock_irqrestore(&lb->lock, flags);
+	/*
+	 * zone batched flush has been already set.
+	 */
+	if (current->zone_ugen)
+		return true;
+
+	/*
+	 * Others are already performing tlb shootdown for us.  All we
+	 * need is to wait for those to complete.
+	 */
+	if (zone_ugen != zone->zone_ugen) {
+		if (!current->wait_zone_ugen ||
+		    ugen_before(current->wait_zone_ugen, zone_ugen))
+			current->wait_zone_ugen = zone_ugen;
+	/*
+	 * It's the first time that zone->zone_ugen has been set to
+	 * current->zone_ugen.  current->luf_ugen also get set.
+	 */
+	} else {
+		current->wait_zone_ugen = prev_ugen(zone->zone_ugen);
+		current->zone_ugen = zone->zone_ugen;
+		current->luf_ugen = zone->luf_ugen;
+
+		/*
+		 * Now that tlb shootdown for the zone_ugen will be
+		 * performed at luf_takeoff_end(), advance it so that
+		 * the next zone->lock holder can efficiently avoid
+		 * unnecessary tlb shootdown.
+		 */
+		zone->zone_ugen = next_ugen(zone->zone_ugen);
 
-	if (!current->luf_ugen || ugen_before(current->luf_ugen, lb_ugen))
-		current->luf_ugen = lb_ugen;
+		/*
+		 * All the luf pages will eventually become non-luf
+		 * pages by tlb flushing at luf_takeoff_end() and,
+		 * flush_pend_list_if_done() will empty pend_list.
+		 */
+		atomic_long_set(&zone->nr_luf_pages, 0);
+		fold_batch(tlb_ubc_takeoff, &zone->zone_batch, true);
+	}
 	return true;
 }
 #endif
@@ -822,6 +951,42 @@ static inline void account_freepages(struct zone *zone, int nr_pages,
 			   zone->nr_free_highatomic + nr_pages);
 }
 
+static void flush_pend_list_if_done(struct zone *zone,
+		struct free_area *area, int migratetype)
+{
+	unsigned long zone_ugen_done = READ_ONCE(zone->zone_ugen_done);
+
+	/*
+	 * tlb shootdown required for the zone_ugen already has been
+	 * done.  Thus, let's move pages in pend_list to free_list to
+	 * secure more non-luf pages.
+	 */
+	if (!ugen_before(zone_ugen_done, area->pend_zone_ugen[migratetype]))
+		list_splice_init(&area->pend_list[migratetype],
+				 &area->free_list[migratetype]);
+}
+
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+/*
+ * Should be called with zone->lock held and irq disabled.
+ */
+static void update_zone_batch(struct zone *zone, unsigned short luf_key)
+{
+	unsigned long lb_ugen;
+	struct luf_batch *lb = &luf_batch[luf_key];
+
+	read_lock(&lb->lock);
+	fold_batch(&zone->zone_batch, &lb->batch, false);
+	lb_ugen = lb->ugen;
+	read_unlock(&lb->lock);
+
+	if (ugen_before(zone->luf_ugen, lb_ugen))
+		zone->luf_ugen = lb_ugen;
+}
+#else
+static void update_zone_batch(struct zone *zone, unsigned short luf_key) {}
+#endif
+
 /* Used for pages not on another list */
 static inline void __add_to_free_list(struct page *page, struct zone *zone,
 				      unsigned int order, int migratetype,
@@ -830,6 +995,12 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
 	struct free_area *area = &zone->free_area[order];
 	struct list_head *list;
 
+	/*
+	 * Good chance to flush pend_list just before updating the
+	 * {free,pend}_list.
+	 */
+	flush_pend_list_if_done(zone, area, migratetype);
+
 	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
 		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
 		     get_pageblock_migratetype(page), migratetype, 1 << order);
@@ -839,8 +1010,9 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
 	 * positive is okay because it will cause just additional tlb
 	 * shootdown.
 	 */
-	if (page_luf_key(page)) {
+	if (page_zone_ugen(zone, page)) {
 		list = &area->pend_list[migratetype];
+		area->pend_zone_ugen[migratetype] = zone->zone_ugen;
 		atomic_long_add(1 << order, &zone->nr_luf_pages);
 	} else
 		list = &area->free_list[migratetype];
@@ -862,6 +1034,7 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 				     unsigned int order, int old_mt, int new_mt)
 {
 	struct free_area *area = &zone->free_area[order];
+	unsigned long zone_ugen = page_zone_ugen(zone, page);
 
 	/* Free page moving can fail, so it happens before the type update */
 	VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt,
@@ -878,9 +1051,12 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 	 * positive is okay because it will cause just additional tlb
 	 * shootdown.
 	 */
-	if (page_luf_key(page))
+	if (zone_ugen) {
 		list_move_tail(&page->buddy_list, &area->pend_list[new_mt]);
-	else
+		if (!area->pend_zone_ugen[new_mt] ||
+		    ugen_before(area->pend_zone_ugen[new_mt], zone_ugen))
+			area->pend_zone_ugen[new_mt] = zone_ugen;
+	} else
 		list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
 
 	account_freepages(zone, -(1 << order), old_mt);
@@ -898,7 +1074,7 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
 	if (page_reported(page))
 		__ClearPageReported(page);
 
-	if (page_luf_key(page))
+	if (page_zone_ugen(zone, page))
 		atomic_long_sub(1 << order, &zone->nr_luf_pages);
 
 	list_del(&page->buddy_list);
@@ -936,29 +1112,39 @@ static inline struct page *get_page_from_free_area(struct zone *zone,
 	 */
 	pend_first = !non_luf_pages_ok(zone);
 
+	/*
+	 * Good chance to flush pend_list just before updating the
+	 * {free,pend}_list.
+	 */
+	flush_pend_list_if_done(zone, area, migratetype);
+
 	if (pend_first) {
 		page = list_first_entry_or_null(&area->pend_list[migratetype],
 				struct page, buddy_list);
 
-		if (page && luf_takeoff_check(page))
+		if (page && luf_takeoff_check(zone, page))
 			return page;
 
 		page = list_first_entry_or_null(&area->free_list[migratetype],
 				struct page, buddy_list);
 
-		if (page)
+		if (page) {
+			set_page_zone_ugen(page, 0);
 			return page;
+		}
 	} else {
 		page = list_first_entry_or_null(&area->free_list[migratetype],
 				struct page, buddy_list);
 
-		if (page)
+		if (page) {
+			set_page_zone_ugen(page, 0);
 			return page;
+		}
 
 		page = list_first_entry_or_null(&area->pend_list[migratetype],
 				struct page, buddy_list);
 
-		if (page && luf_takeoff_check(page))
+		if (page && luf_takeoff_check(zone, page))
 			return page;
 	}
 	return NULL;
@@ -1023,6 +1209,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned long combined_pfn;
 	struct page *buddy;
 	bool to_tail;
+	unsigned long zone_ugen;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
@@ -1034,20 +1221,25 @@ static inline void __free_one_page(struct page *page,
 	account_freepages(zone, 1 << order, migratetype);
 
 	/*
-	 * Use the page's luf_key unchanged if luf_key == 0.  Worth
-	 * noting that page_luf_key() will be 0 in most cases since it's
-	 * initialized at free_pages_prepare().
+	 * Use the page's zone_ugen unchanged if luf_key == 0.  Worth
+	 * noting that page_zone_ugen() will be 0 in most cases since
+	 * it's initialized at free_pages_prepare().
+	 *
+	 * Update page's zone_ugen and zone's batch only if a valid
+	 * luf_key was passed.
 	 */
-	if (luf_key)
-		set_page_luf_key(page, luf_key);
-	else
-		luf_key = page_luf_key(page);
+	if (luf_key) {
+		zone_ugen = zone->zone_ugen;
+		set_page_zone_ugen(page, (unsigned short)zone_ugen);
+		update_zone_batch(zone, luf_key);
+	} else
+		zone_ugen = page_zone_ugen(zone, page);
 
 	while (order < MAX_PAGE_ORDER) {
 		int buddy_mt = migratetype;
-		unsigned short buddy_luf_key;
+		unsigned long buddy_zone_ugen;
 
-		if (!luf_key && compaction_capture(capc, page, order, migratetype)) {
+		if (!zone_ugen && compaction_capture(capc, page, order, migratetype)) {
 			account_freepages(zone, -(1 << order), migratetype);
 			return;
 		}
@@ -1080,17 +1272,15 @@ static inline void __free_one_page(struct page *page,
 		else
 			__del_page_from_free_list(buddy, zone, order, buddy_mt);
 
+		buddy_zone_ugen = page_zone_ugen(zone, buddy);
+
 		/*
-		 * !buddy_luf_key && !luf_key : do nothing
-		 *  buddy_luf_key && !luf_key : luf_key = buddy_luf_key
-		 * !buddy_luf_key &&  luf_key : do nothing
-		 *  buddy_luf_key &&  luf_key : merge two into luf_key
+		 * if (!zone_ugen && !buddy_zone_ugen) : nothing to do
+		 * if ( zone_ugen && !buddy_zone_ugen) : nothing to do
 		 */
-		buddy_luf_key = page_luf_key(buddy);
-		if (buddy_luf_key && !luf_key)
-			luf_key = buddy_luf_key;
-		else if (buddy_luf_key && luf_key)
-			fold_luf_batch(&luf_batch[luf_key], &luf_batch[buddy_luf_key]);
+		if ((!zone_ugen && buddy_zone_ugen) ||
+		    ( zone_ugen && buddy_zone_ugen && ugen_before(zone_ugen, buddy_zone_ugen)))
+			zone_ugen = buddy_zone_ugen;
 
 		if (unlikely(buddy_mt != migratetype)) {
 			/*
@@ -1103,7 +1293,7 @@ static inline void __free_one_page(struct page *page,
 
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
-		set_page_luf_key(page, luf_key);
+		set_page_zone_ugen(page, zone_ugen);
 		pfn = combined_pfn;
 		order++;
 	}
@@ -1524,6 +1714,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		do {
 			unsigned long pfn;
 			int mt;
+			unsigned short luf_key;
 
 			page = list_last_entry(list, struct page, pcp_list);
 			pfn = page_to_pfn(page);
@@ -1534,7 +1725,16 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			count -= nr_pages;
 			pcp->count -= nr_pages;
 
-			__free_one_page(page, pfn, zone, order, mt, FPI_NONE, 0);
+			/*
+			 * page private in pcp stores luf_key while it
+			 * stores zone_ugen in buddy.  Thus, the private
+			 * needs to be cleared and the luf_key needs to
+			 * be passed to buddy.
+			 */
+			luf_key = page_luf_key(page);
+			set_page_private(page, 0);
+
+			__free_one_page(page, pfn, zone, order, mt, FPI_NONE, luf_key);
 
 			trace_mm_page_pcpu_drain(page, order, mt);
 		} while (count > 0 && !list_empty(list));
@@ -1579,7 +1779,15 @@ static void free_one_page(struct zone *zone, struct page *page,
 	 * valid luf_key can be passed only if order == 0.
 	 */
 	VM_WARN_ON(luf_key && order);
-	set_page_luf_key(page, luf_key);
+
+	/*
+	 * Update page's zone_ugen and zone's batch only if a valid
+	 * luf_key was passed.
+	 */
+	if (luf_key) {
+		set_page_zone_ugen(page, (unsigned short)zone->zone_ugen);
+		update_zone_batch(zone, luf_key);
+	}
 
 	split_large_buddy(zone, page, pfn, order, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -1733,7 +1941,7 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 		if (set_page_guard(zone, &page[size], high))
 			continue;
 
-		if (page_luf_key(&page[size]))
+		if (page_zone_ugen(zone, &page[size]))
 			tail = true;
 
 		__add_to_free_list(&page[size], zone, high, migratetype, tail);
@@ -1751,7 +1959,7 @@ static __always_inline void page_del_and_expand(struct zone *zone,
 	int nr_pages = 1 << high;
 
 	__del_page_from_free_list(page, zone, high, migratetype);
-	if (unlikely(!luf_takeoff_check_and_fold(page)))
+	if (unlikely(!luf_takeoff_check_and_fold(zone, page)))
 		VM_WARN_ON(1);
 	nr_pages -= expand(zone, page, low, high, migratetype);
 	account_freepages(zone, -nr_pages, migratetype);
@@ -2280,7 +2488,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
 		unsigned int nr_added;
 
 		del_page_from_free_list(page, zone, current_order, block_type);
-		if (unlikely(!luf_takeoff_check_and_fold(page)))
+		if (unlikely(!luf_takeoff_check_and_fold(zone, page)))
 			VM_WARN_ON(1);
 		change_pageblock_range(page, current_order, start_type);
 		nr_added = expand(zone, page, order, current_order, start_type);
@@ -2519,12 +2727,12 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			WARN_ON_ONCE(ret == -1);
 			if (ret > 0) {
 				spin_unlock_irqrestore(&zone->lock, flags);
-				luf_takeoff_end();
+				luf_takeoff_end(zone);
 				return ret;
 			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
-		luf_takeoff_end();
+		luf_takeoff_end(zone);
 	}
 
 	return false;
@@ -2689,12 +2897,15 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * pages are ordered properly.
 		 */
 		list_add_tail(&page->pcp_list, list);
+
+		/*
+		 * Reset all the luf fields.  tlb shootdown will be
+		 * performed at luf_takeoff_end() below if needed.
+		 */
+		set_page_private(page, 0);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
-	/*
-	 * Check and flush before using the pages taken off.
-	 */
-	luf_takeoff_end();
+	luf_takeoff_end(zone);
 
 	return i;
 }
@@ -3208,7 +3419,7 @@ int __isolate_free_page(struct page *page, unsigned int order, bool willputback)
 	}
 
 	del_page_from_free_list(page, zone, order, mt);
-	if (unlikely(!willputback && !luf_takeoff_check_and_fold(page)))
+	if (unlikely(!willputback && !luf_takeoff_check_and_fold(zone, page)))
 		VM_WARN_ON(1);
 
 	/*
@@ -3307,7 +3518,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 			if (!page) {
 				spin_unlock_irqrestore(&zone->lock, flags);
-				luf_takeoff_end();
+				luf_takeoff_end(zone);
 				return NULL;
 			}
 		}
@@ -3315,7 +3526,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 		/*
 		 * Check and flush before using the pages taken off.
 		 */
-		luf_takeoff_end();
+		luf_takeoff_end(zone);
 	} while (check_new_pages(page, order));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -3405,7 +3616,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		}
 
 		list_for_each_entry(page, list, pcp_list) {
-			if (luf_takeoff_check_and_fold(page)) {
+			if (luf_takeoff_check_and_fold(NULL, page)) {
 				list_del(&page->pcp_list);
 				pcp->count -= 1 << order;
 				break;
@@ -3440,7 +3651,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (!pcp) {
 		pcp_trylock_finish(UP_flags);
-		luf_takeoff_end();
+		luf_takeoff_end(NULL);
 		return NULL;
 	}
 
@@ -3457,7 +3668,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(NULL);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone, 1);
@@ -3496,6 +3707,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 							migratetype);
 
 out:
+
 	/* Separate test+clear to avoid unnecessary atomics */
 	if ((alloc_flags & ALLOC_KSWAPD) &&
 	    unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
@@ -5095,7 +5307,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(NULL);
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
 	zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_account);
@@ -5105,7 +5317,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 
 failed_irq:
 	pcp_trylock_finish(UP_flags);
-	luf_takeoff_end();
+	luf_takeoff_end(NULL);
 
 failed:
 	page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
@@ -7188,7 +7400,7 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 		VM_WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE);
 		order = buddy_order(page);
 		del_page_from_free_list(page, zone, order, MIGRATE_ISOLATE);
-		if (unlikely(!luf_takeoff_check_and_fold(page)))
+		if (unlikely(!luf_takeoff_check_and_fold(zone, page)))
 			VM_WARN_ON(1);
 		pfn += (1 << order);
 	}
@@ -7196,7 +7408,7 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(zone);
 
 	return end_pfn - start_pfn - already_offline;
 }
@@ -7258,7 +7470,7 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, current_buddy, high))
 			continue;
 
-		if (page_luf_key(current_buddy))
+		if (page_zone_ugen(zone, current_buddy))
 			tail = true;
 
 		add_to_free_list(current_buddy, zone, high, migratetype, tail);
@@ -7290,7 +7502,7 @@ bool take_page_off_buddy(struct page *page)
 
 			del_page_from_free_list(page_head, zone, page_order,
 						migratetype);
-			if (unlikely(!luf_takeoff_check_and_fold(page_head)))
+			if (unlikely(!luf_takeoff_check_and_fold(zone, page_head)))
 				VM_WARN_ON(1);
 			break_down_buddy_pages(zone, page_head, page, 0,
 						page_order, migratetype);
@@ -7306,7 +7518,7 @@ bool take_page_off_buddy(struct page *page)
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(zone);
 	return ret;
 }
 
@@ -7325,6 +7537,13 @@ bool put_page_back_buddy(struct page *page)
 		int migratetype = get_pfnblock_migratetype(page, pfn);
 
 		ClearPageHWPoisonTakenOff(page);
+
+		/*
+		 * Reset all the luf fields.  tlb shootdown has already
+		 * been performed by take_page_off_buddy().
+		 */
+		set_page_private(page, 0);
+
 		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE, 0);
 		if (TestClearPageHWPoison(page)) {
 			ret = true;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index e152b22fbba8a..b23d3ed34ec07 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -118,7 +118,8 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
 
 		/*
 		 * Ensure private is zero before putting into the
-		 * allocator.
+		 * allocator.  tlb shootdown has already been performed
+		 * at isolation.
 		 */
 		set_page_private(page, 0);
 
@@ -194,7 +195,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		if (PageReported(page))
 			continue;
 
-		if (unlikely(consider_pend && !luf_takeoff_check(page))) {
+		if (unlikely(consider_pend && !luf_takeoff_check(zone, page))) {
 			VM_WARN_ON(1);
 			continue;
 		}
@@ -238,7 +239,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		/*
 		 * Check and flush before using the pages taken off.
 		 */
-		luf_takeoff_end();
+		luf_takeoff_end(zone);
 
 		/* begin processing pages in local list */
 		err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
@@ -283,7 +284,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	/*
 	 * Check and flush before using the pages taken off.
 	 */
-	luf_takeoff_end();
+	luf_takeoff_end(zone);
 
 	return err;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index f5c5190be24e0..a2dc002a9c33d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -650,7 +650,11 @@ static unsigned long new_luf_ugen(void)
 {
 	unsigned long ugen = atomic_long_inc_return(&luf_ugen);
 
-	if (!ugen)
+	/*
+	 * Avoid zero even in unsigned short range so as to treat
+	 * '(unsigned short)ugen == 0' as invalid.
+	 */
+	if (!(unsigned short)ugen)
 		ugen = atomic_long_inc_return(&luf_ugen);
 
 	return ugen;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 22/25]  mm/page_alloc: not allow to tlb shootdown if !preemptable() && non_luf_pages_ok()
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (19 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 21/25] mm: perform luf tlb shootdown per zone in batched manner Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 23/25] mm/migrate: apply luf mechanism to unmapping during migration Byungchul Park
                             ` (2 subsequent siblings)
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

Do not perform tlb shootdown if the context is in preempt disable and
there are already enough non luf pages, not to hurt preemptibility.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 mm/compaction.c     |  6 +++---
 mm/internal.h       |  5 +++--
 mm/page_alloc.c     | 27 +++++++++++++++------------
 mm/page_isolation.c |  2 +-
 mm/page_reporting.c |  4 ++--
 5 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c87a1803b10e2..9098ddb04bbf5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -606,7 +606,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 
 	page = pfn_to_page(blockpfn);
 
-	luf_takeoff_start();
+	luf_takeoff_start(cc->zone);
 	/* Isolate free pages. */
 	for (; blockpfn < end_pfn; blockpfn += stride, page += stride) {
 		int isolated;
@@ -1603,7 +1603,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
 		if (!area->nr_free)
 			continue;
 
-		can_shootdown = luf_takeoff_start();
+		can_shootdown = luf_takeoff_start(cc->zone);
 		spin_lock_irqsave(&cc->zone->lock, flags);
 		freelist = &area->free_list[MIGRATE_MOVABLE];
 retry:
@@ -2417,7 +2417,7 @@ static enum compact_result compact_finished(struct compact_control *cc)
 	 * luf_takeoff_{start,end}() is required to identify whether
 	 * this compaction context is tlb shootdownable for luf'd pages.
 	 */
-	luf_takeoff_start();
+	luf_takeoff_start(cc->zone);
 	ret = __compact_finished(cc);
 	luf_takeoff_end(cc->zone);
 
diff --git a/mm/internal.h b/mm/internal.h
index 53056ad7dade9..7c4198f5e22c3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1609,7 +1609,7 @@ static inline void accept_page(struct page *page)
 #endif /* CONFIG_UNACCEPTED_MEMORY */
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
 extern struct luf_batch luf_batch[];
-bool luf_takeoff_start(void);
+bool luf_takeoff_start(struct zone *zone);
 void luf_takeoff_end(struct zone *zone);
 bool luf_takeoff_no_shootdown(void);
 bool luf_takeoff_check(struct zone *zone, struct page *page);
@@ -1623,6 +1623,7 @@ static inline bool non_luf_pages_ok(struct zone *zone)
 
 	return nr_free - nr_luf_pages > min_wm;
 }
+
 unsigned short fold_unmap_luf(void);
 
 /*
@@ -1709,7 +1710,7 @@ static inline bool can_luf_vma(struct vm_area_struct *vma)
 	return true;
 }
 #else /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
-static inline bool luf_takeoff_start(void) { return false; }
+static inline bool luf_takeoff_start(struct zone *zone) { return false; }
 static inline void luf_takeoff_end(struct zone *zone) {}
 static inline bool luf_takeoff_no_shootdown(void) { return true; }
 static inline bool luf_takeoff_check(struct zone *zone, struct page *page) { return true; }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0f986cfa4fe39..9a58d6f7a9609 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -623,22 +623,25 @@ compaction_capture(struct capture_control *capc, struct page *page,
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH)
-static bool no_shootdown_context(void)
+static bool no_shootdown_context(struct zone *zone)
 {
 	/*
-	 * If it performs with irq disabled, that might cause a deadlock.
-	 * Avoid tlb shootdown in this case.
+	 * Tries to avoid tlb shootdown if !preemptible().  However, it
+	 * should be allowed under heavy memory pressure.
 	 */
+	if (zone && non_luf_pages_ok(zone))
+		return !(preemptible() && in_task());
+
 	return !(!irqs_disabled() && in_task());
 }
 
 /*
  * Can be called with zone lock released and irq enabled.
  */
-bool luf_takeoff_start(void)
+bool luf_takeoff_start(struct zone *zone)
 {
 	unsigned long flags;
-	bool no_shootdown = no_shootdown_context();
+	bool no_shootdown = no_shootdown_context(zone);
 
 	local_irq_save(flags);
 
@@ -2669,7 +2672,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 		 * luf_takeoff_{start,end}() is required for
 		 * get_page_from_free_area() to use luf_takeoff_check().
 		 */
-		luf_takeoff_start();
+		luf_takeoff_start(zone);
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < NR_PAGE_ORDERS; order++) {
 			struct free_area *area = &(zone->free_area[order]);
@@ -2874,7 +2877,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	unsigned long flags;
 	int i;
 
-	luf_takeoff_start();
+	luf_takeoff_start(zone);
 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
@@ -3500,7 +3503,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 	do {
 		page = NULL;
-		luf_takeoff_start();
+		luf_takeoff_start(zone);
 		spin_lock_irqsave(&zone->lock, flags);
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
@@ -3645,7 +3648,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct page *page;
 	unsigned long __maybe_unused UP_flags;
 
-	luf_takeoff_start();
+	luf_takeoff_start(NULL);
 	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
@@ -5268,7 +5271,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	if (unlikely(!zone))
 		goto failed;
 
-	luf_takeoff_start();
+	luf_takeoff_start(NULL);
 	/* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
@@ -7371,7 +7374,7 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 
 	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
-	luf_takeoff_start();
+	luf_takeoff_start(zone);
 	spin_lock_irqsave(&zone->lock, flags);
 	while (pfn < end_pfn) {
 		page = pfn_to_page(pfn);
@@ -7489,7 +7492,7 @@ bool take_page_off_buddy(struct page *page)
 	unsigned int order;
 	bool ret = false;
 
-	luf_takeoff_start();
+	luf_takeoff_start(zone);
 	spin_lock_irqsave(&zone->lock, flags);
 	for (order = 0; order < NR_PAGE_ORDERS; order++) {
 		struct page *page_head = page - (pfn & ((1 << order) - 1));
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 521ed32bdbf67..70f938c0921ae 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -218,7 +218,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	struct page *buddy;
 
 	zone = page_zone(page);
-	luf_takeoff_start();
+	luf_takeoff_start(zone);
 	spin_lock_irqsave(&zone->lock, flags);
 	if (!is_migrate_isolate_page(page))
 		goto out;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index b23d3ed34ec07..83b66e7f0d257 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -170,7 +170,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	if (free_area_empty(area, mt))
 		return err;
 
-	can_shootdown = luf_takeoff_start();
+	can_shootdown = luf_takeoff_start(zone);
 	spin_lock_irq(&zone->lock);
 
 	/*
@@ -250,7 +250,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		/* update budget to reflect call to report function */
 		budget--;
 
-		luf_takeoff_start();
+		luf_takeoff_start(zone);
 
 		/* reacquire zone lock and resume processing */
 		spin_lock_irq(&zone->lock);
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 23/25]  mm/migrate: apply luf mechanism to unmapping during migration
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (20 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 22/25] mm/page_alloc: not allow to tlb shootdown if !preemptable() && non_luf_pages_ok() Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 24/25] mm/vmscan: apply luf mechanism to unmapping during folio reclaim Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 25/25] mm/luf: implement luf debug feature Byungchul Park
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

Applied the mechanism to unmapping during migration.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mm.h   |  2 ++
 include/linux/rmap.h |  2 +-
 mm/migrate.c         | 66 ++++++++++++++++++++++++++++++++++----------
 mm/rmap.c            | 15 ++++++----
 mm/swap.c            |  2 +-
 5 files changed, 64 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2fa5185880105..b41d7804a06a2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1566,6 +1566,8 @@ static inline void folio_put(struct folio *folio)
 		__folio_put(folio);
 }
 
+void page_cache_release(struct folio *folio);
+
 /**
  * folio_put_refs - Reduce the reference count on a folio.
  * @folio: The folio.
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6abf7960077aa..bfccf2efb9000 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -675,7 +675,7 @@ static inline int folio_try_share_anon_rmap_pmd(struct folio *folio,
 int folio_referenced(struct folio *, int is_locked,
 			struct mem_cgroup *memcg, unsigned long *vm_flags);
 
-void try_to_migrate(struct folio *folio, enum ttu_flags flags);
+bool try_to_migrate(struct folio *folio, enum ttu_flags flags);
 void try_to_unmap(struct folio *, enum ttu_flags flags);
 
 struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
diff --git a/mm/migrate.c b/mm/migrate.c
index 365c6daa8d1b1..7d6472cc236ae 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1164,7 +1164,8 @@ static void migrate_folio_undo_dst(struct folio *dst, bool locked,
 
 /* Cleanup src folio upon migration success */
 static void migrate_folio_done(struct folio *src,
-			       enum migrate_reason reason)
+			       enum migrate_reason reason,
+			       unsigned short luf_key)
 {
 	/*
 	 * Compaction can migrate also non-LRU pages which are
@@ -1175,16 +1176,31 @@ static void migrate_folio_done(struct folio *src,
 		mod_node_page_state(folio_pgdat(src), NR_ISOLATED_ANON +
 				    folio_is_file_lru(src), -folio_nr_pages(src));
 
-	if (reason != MR_MEMORY_FAILURE)
-		/* We release the page in page_handle_poison. */
+	/* We release the page in page_handle_poison. */
+	if (reason == MR_MEMORY_FAILURE)
+		luf_flush(luf_key);
+	else if (!luf_key)
 		folio_put(src);
+	else {
+		/*
+		 * Should be the last reference.
+		 */
+		if (unlikely(!folio_put_testzero(src)))
+			VM_WARN_ON(1);
+
+		page_cache_release(src);
+		folio_unqueue_deferred_split(src);
+		mem_cgroup_uncharge(src);
+		free_frozen_pages(&src->page, folio_order(src), luf_key);
+	}
 }
 
 /* Obtain the lock on page, remove all ptes. */
 static int migrate_folio_unmap(new_folio_t get_new_folio,
 		free_folio_t put_new_folio, unsigned long private,
 		struct folio *src, struct folio **dstp, enum migrate_mode mode,
-		enum migrate_reason reason, struct list_head *ret)
+		enum migrate_reason reason, struct list_head *ret,
+		bool *can_luf)
 {
 	struct folio *dst;
 	int rc = -EAGAIN;
@@ -1200,7 +1216,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
 		folio_clear_unevictable(src);
 		/* free_pages_prepare() will clear PG_isolated. */
 		list_del(&src->lru);
-		migrate_folio_done(src, reason);
+		migrate_folio_done(src, reason, 0);
 		return MIGRATEPAGE_SUCCESS;
 	}
 
@@ -1317,7 +1333,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
 		/* Establish migration ptes */
 		VM_BUG_ON_FOLIO(folio_test_anon(src) &&
 			       !folio_test_ksm(src) && !anon_vma, src);
-		try_to_migrate(src, mode == MIGRATE_ASYNC ? TTU_BATCH_FLUSH : 0);
+		*can_luf = try_to_migrate(src, mode == MIGRATE_ASYNC ? TTU_BATCH_FLUSH : 0);
 		old_page_state |= PAGE_WAS_MAPPED;
 	}
 
@@ -1345,7 +1361,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
 static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 			      struct folio *src, struct folio *dst,
 			      enum migrate_mode mode, enum migrate_reason reason,
-			      struct list_head *ret)
+			      struct list_head *ret, unsigned short luf_key)
 {
 	int rc;
 	int old_page_state = 0;
@@ -1399,7 +1415,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 	folio_unlock(src);
-	migrate_folio_done(src, reason);
+	migrate_folio_done(src, reason, luf_key);
 
 	return rc;
 out:
@@ -1694,7 +1710,7 @@ static void migrate_folios_move(struct list_head *src_folios,
 		struct list_head *ret_folios,
 		struct migrate_pages_stats *stats,
 		int *retry, int *thp_retry, int *nr_failed,
-		int *nr_retry_pages)
+		int *nr_retry_pages, unsigned short luf_key)
 {
 	struct folio *folio, *folio2, *dst, *dst2;
 	bool is_thp;
@@ -1711,7 +1727,7 @@ static void migrate_folios_move(struct list_head *src_folios,
 
 		rc = migrate_folio_move(put_new_folio, private,
 				folio, dst, mode,
-				reason, ret_folios);
+				reason, ret_folios, luf_key);
 		/*
 		 * The rules are:
 		 *	Success: folio will be freed
@@ -1788,7 +1804,11 @@ static int migrate_pages_batch(struct list_head *from,
 	int rc, rc_saved = 0, nr_pages;
 	LIST_HEAD(unmap_folios);
 	LIST_HEAD(dst_folios);
+	LIST_HEAD(unmap_folios_luf);
+	LIST_HEAD(dst_folios_luf);
 	bool nosplit = (reason == MR_NUMA_MISPLACED);
+	unsigned short luf_key;
+	bool can_luf;
 
 	VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC &&
 			!list_empty(from) && !list_is_singular(from));
@@ -1863,9 +1883,11 @@ static int migrate_pages_batch(struct list_head *from,
 				continue;
 			}
 
+			can_luf = false;
 			rc = migrate_folio_unmap(get_new_folio, put_new_folio,
 					private, folio, &dst, mode, reason,
-					ret_folios);
+					ret_folios, &can_luf);
+
 			/*
 			 * The rules are:
 			 *	Success: folio will be freed
@@ -1911,7 +1933,8 @@ static int migrate_pages_batch(struct list_head *from,
 				/* nr_failed isn't updated for not used */
 				stats->nr_thp_failed += thp_retry;
 				rc_saved = rc;
-				if (list_empty(&unmap_folios))
+				if (list_empty(&unmap_folios) &&
+				    list_empty(&unmap_folios_luf))
 					goto out;
 				else
 					goto move;
@@ -1925,8 +1948,13 @@ static int migrate_pages_batch(struct list_head *from,
 				stats->nr_thp_succeeded += is_thp;
 				break;
 			case MIGRATEPAGE_UNMAP:
-				list_move_tail(&folio->lru, &unmap_folios);
-				list_add_tail(&dst->lru, &dst_folios);
+				if (can_luf) {
+					list_move_tail(&folio->lru, &unmap_folios_luf);
+					list_add_tail(&dst->lru, &dst_folios_luf);
+				} else {
+					list_move_tail(&folio->lru, &unmap_folios);
+					list_add_tail(&dst->lru, &dst_folios);
+				}
 				break;
 			default:
 				/*
@@ -1946,6 +1974,8 @@ static int migrate_pages_batch(struct list_head *from,
 	stats->nr_thp_failed += thp_retry;
 	stats->nr_failed_pages += nr_retry_pages;
 move:
+	/* Should be before try_to_unmap_flush() */
+	luf_key = fold_unmap_luf();
 	/* Flush TLBs for all unmapped folios */
 	try_to_unmap_flush();
 
@@ -1959,7 +1989,11 @@ static int migrate_pages_batch(struct list_head *from,
 		migrate_folios_move(&unmap_folios, &dst_folios,
 				put_new_folio, private, mode, reason,
 				ret_folios, stats, &retry, &thp_retry,
-				&nr_failed, &nr_retry_pages);
+				&nr_failed, &nr_retry_pages, 0);
+		migrate_folios_move(&unmap_folios_luf, &dst_folios_luf,
+				put_new_folio, private, mode, reason,
+				ret_folios, stats, &retry, &thp_retry,
+				&nr_failed, &nr_retry_pages, luf_key);
 	}
 	nr_failed += retry;
 	stats->nr_thp_failed += thp_retry;
@@ -1970,6 +2004,8 @@ static int migrate_pages_batch(struct list_head *from,
 	/* Cleanup remaining folios */
 	migrate_folios_undo(&unmap_folios, &dst_folios,
 			put_new_folio, private, ret_folios);
+	migrate_folios_undo(&unmap_folios_luf, &dst_folios_luf,
+			put_new_folio, private, ret_folios);
 
 	return rc;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index a2dc002a9c33d..e645bb0dd44b5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2925,8 +2925,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
  *
  * Tries to remove all the page table entries which are mapping this folio and
  * replace them with special swap entries. Caller must hold the folio lock.
+ * Return true if all the mappings are read-only, otherwise false.
  */
-void try_to_migrate(struct folio *folio, enum ttu_flags flags)
+bool try_to_migrate(struct folio *folio, enum ttu_flags flags)
 {
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_migrate_one,
@@ -2944,11 +2945,11 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 	 */
 	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
 					TTU_SYNC | TTU_BATCH_FLUSH)))
-		return;
+		return false;
 
 	if (folio_is_zone_device(folio) &&
 	    (!folio_is_device_private(folio) && !folio_is_device_coherent(folio)))
-		return;
+		return false;
 
 	/*
 	 * During exec, a temporary VMA is setup and later moved.
@@ -2968,10 +2969,12 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 	else
 		rmap_walk(folio, &rwc);
 
-	if (can_luf_test())
+	if (can_luf_test()) {
 		fold_batch(tlb_ubc_luf, tlb_ubc_ro, true);
-	else
-		fold_batch(tlb_ubc, tlb_ubc_ro, true);
+		return true;
+	}
+	fold_batch(tlb_ubc, tlb_ubc_ro, true);
+	return false;
 }
 
 #ifdef CONFIG_DEVICE_PRIVATE
diff --git a/mm/swap.c b/mm/swap.c
index bdfede631aea9..21374892854eb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -84,7 +84,7 @@ static void __page_cache_release(struct folio *folio, struct lruvec **lruvecp,
  * This path almost never happens for VM activity - pages are normally freed
  * in batches.  But it gets used by networking - and for compound pages.
  */
-static void page_cache_release(struct folio *folio)
+void page_cache_release(struct folio *folio)
 {
 	struct lruvec *lruvec = NULL;
 	unsigned long flags;
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 24/25]  mm/vmscan: apply luf mechanism to unmapping during folio reclaim
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (21 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 23/25] mm/migrate: apply luf mechanism to unmapping during migration Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 25/25] mm/luf: implement luf debug feature Byungchul Park
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read only and were unmapped, since
the contents of the folios don't change while staying in pcp or buddy
so we can still read the data through the stale tlb entries.

Applied the mechanism to unmapping during folio reclaim.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/rmap.h |  5 +++--
 mm/rmap.c            | 11 +++++++----
 mm/vmscan.c          | 37 ++++++++++++++++++++++++++++++++-----
 3 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bfccf2efb9000..8002f4b2a2d14 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -676,7 +676,7 @@ int folio_referenced(struct folio *, int is_locked,
 			struct mem_cgroup *memcg, unsigned long *vm_flags);
 
 bool try_to_migrate(struct folio *folio, enum ttu_flags flags);
-void try_to_unmap(struct folio *, enum ttu_flags flags);
+bool try_to_unmap(struct folio *, enum ttu_flags flags);
 
 struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 		void *owner, struct folio **foliop);
@@ -811,8 +811,9 @@ static inline int folio_referenced(struct folio *folio, int is_locked,
 	return 0;
 }
 
-static inline void try_to_unmap(struct folio *folio, enum ttu_flags flags)
+static inline bool try_to_unmap(struct folio *folio, enum ttu_flags flags)
 {
+	return false;
 }
 
 static inline int folio_mkclean(struct folio *folio)
diff --git a/mm/rmap.c b/mm/rmap.c
index e645bb0dd44b5..124ef59afa25e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2583,10 +2583,11 @@ static int folio_not_mapped(struct folio *folio)
  * Tries to remove all the page table entries which are mapping this
  * folio.  It is the caller's responsibility to check if the folio is
  * still mapped if needed (use TTU_SYNC to prevent accounting races).
+ * Return true if all the mappings are read-only, otherwise false.
  *
  * Context: Caller must hold the folio lock.
  */
-void try_to_unmap(struct folio *folio, enum ttu_flags flags)
+bool try_to_unmap(struct folio *folio, enum ttu_flags flags)
 {
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_unmap_one,
@@ -2605,10 +2606,12 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags)
 	else
 		rmap_walk(folio, &rwc);
 
-	if (can_luf_test())
+	if (can_luf_test()) {
 		fold_batch(tlb_ubc_luf, tlb_ubc_ro, true);
-	else
-		fold_batch(tlb_ubc, tlb_ubc_ro, true);
+		return true;
+	}
+	fold_batch(tlb_ubc, tlb_ubc_ro, true);
+	return false;
 }
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f145c09629b97..a24d2d05df43a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1102,14 +1102,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		struct reclaim_stat *stat, bool ignore_references)
 {
 	struct folio_batch free_folios;
+	struct folio_batch free_folios_luf;
 	LIST_HEAD(ret_folios);
 	LIST_HEAD(demote_folios);
 	unsigned int nr_reclaimed = 0, nr_demoted = 0;
 	unsigned int pgactivate = 0;
 	bool do_demote_pass;
 	struct swap_iocb *plug = NULL;
+	unsigned short luf_key;
 
 	folio_batch_init(&free_folios);
+	folio_batch_init(&free_folios_luf);
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
 	do_demote_pass = can_demote(pgdat->node_id, sc);
@@ -1121,6 +1124,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		enum folio_references references = FOLIOREF_RECLAIM;
 		bool dirty, writeback;
 		unsigned int nr_pages;
+		bool can_luf = false;
 
 		cond_resched();
 
@@ -1354,7 +1358,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			if (folio_test_large(folio))
 				flags |= TTU_SYNC;
 
-			try_to_unmap(folio, flags);
+			can_luf = try_to_unmap(folio, flags);
 			if (folio_mapped(folio)) {
 				stat->nr_unmap_fail += nr_pages;
 				if (!was_swapbacked &&
@@ -1498,6 +1502,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					 * leave it off the LRU).
 					 */
 					nr_reclaimed += nr_pages;
+					if (can_luf)
+						luf_flush(fold_unmap_luf());
 					continue;
 				}
 			}
@@ -1530,6 +1536,19 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		nr_reclaimed += nr_pages;
 
 		folio_unqueue_deferred_split(folio);
+
+		if (can_luf) {
+			if (folio_batch_add(&free_folios_luf, folio) == 0) {
+				mem_cgroup_uncharge_folios(&free_folios);
+				mem_cgroup_uncharge_folios(&free_folios_luf);
+				luf_key = fold_unmap_luf();
+				try_to_unmap_flush();
+				free_unref_folios(&free_folios, 0);
+				free_unref_folios(&free_folios_luf, luf_key);
+			}
+			continue;
+		}
+
 		if (folio_batch_add(&free_folios, folio) == 0) {
 			mem_cgroup_uncharge_folios(&free_folios);
 			try_to_unmap_flush();
@@ -1564,9 +1583,21 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		list_add(&folio->lru, &ret_folios);
 		VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
 				folio_test_unevictable(folio), folio);
+		if (can_luf)
+			luf_flush(fold_unmap_luf());
 	}
 	/* 'folio_list' is always empty here */
 
+	/*
+	 * Finalize this turn before demote_folio_list().
+	 */
+	mem_cgroup_uncharge_folios(&free_folios);
+	mem_cgroup_uncharge_folios(&free_folios_luf);
+	luf_key = fold_unmap_luf();
+	try_to_unmap_flush();
+	free_unref_folios(&free_folios, 0);
+	free_unref_folios(&free_folios_luf, luf_key);
+
 	/* Migrate folios selected for demotion */
 	nr_demoted = demote_folio_list(&demote_folios, pgdat);
 	nr_reclaimed += nr_demoted;
@@ -1600,10 +1631,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 	pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
 
-	mem_cgroup_uncharge_folios(&free_folios);
-	try_to_unmap_flush();
-	free_unref_folios(&free_folios, 0);
-
 	list_splice(&ret_folios, folio_list);
 	count_vm_events(PGACTIVATE, pgactivate);
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 25/25] mm/luf: implement luf debug feature
  2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
                             ` (22 preceding siblings ...)
  2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 24/25] mm/vmscan: apply luf mechanism to unmapping during folio reclaim Byungchul Park
@ 2025-02-26 12:01           ` Byungchul Park
  23 siblings, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-26 12:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kernel_team, akpm, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, rjgolo

We need luf debug feature to detect when luf goes wrong by any chance.
As a RFC, suggest a simple implementation to report problematic
situations by luf.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 arch/riscv/include/asm/tlbflush.h |   3 +
 arch/riscv/mm/tlbflush.c          |  35 ++++-
 arch/x86/include/asm/pgtable.h    |  10 ++
 arch/x86/include/asm/tlbflush.h   |   3 +
 arch/x86/mm/pgtable.c             |  10 ++
 arch/x86/mm/tlb.c                 |  35 ++++-
 include/linux/highmem-internal.h  |   5 +
 include/linux/mm.h                |  20 ++-
 include/linux/mm_types.h          |  16 +--
 include/linux/mm_types_task.h     |  16 +++
 include/linux/sched.h             |   5 +
 mm/highmem.c                      |   1 +
 mm/memory.c                       |  12 ++
 mm/page_alloc.c                   |  34 ++++-
 mm/page_ext.c                     |   3 +
 mm/rmap.c                         | 229 ++++++++++++++++++++++++++++++
 16 files changed, 418 insertions(+), 19 deletions(-)

diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index 936bf9ce0abd9..b927d134cda9b 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -68,6 +68,9 @@ bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, unsigned
 bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
 void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
 void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen);
+#ifdef CONFIG_LUF_DEBUG
+extern void print_lufd_arch(void);
+#endif
 
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 6ce44370a8e11..345846fbc2ecf 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -215,6 +215,25 @@ static int __init luf_init_arch(void)
 }
 early_initcall(luf_init_arch);
 
+#ifdef CONFIG_LUF_DEBUG
+static DEFINE_SPINLOCK(luf_debug_lock);
+#define lufd_lock(f) spin_lock_irqsave(&luf_debug_lock, (f))
+#define lufd_unlock(f) spin_unlock_irqrestore(&luf_debug_lock, (f))
+
+void print_lufd_arch(void)
+{
+	int cpu;
+
+	pr_cont("LUFD ARCH:");
+	for_each_cpu(cpu, cpu_possible_mask)
+		pr_cont(" %lu", atomic_long_read(per_cpu_ptr(&ugen_done, cpu)));
+	pr_cont("\n");
+}
+#else
+#define lufd_lock(f) do { (void)(f); } while(0)
+#define lufd_unlock(f) do { (void)(f); } while(0)
+#endif
+
 /*
  * batch will not be updated.
  */
@@ -222,17 +241,22 @@ bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch,
 			unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		goto out;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		unsigned long done;
 
 		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
-		if (ugen_before(done, ugen))
+		if (ugen_before(done, ugen)) {
+			lufd_unlock(flags);
 			return false;
+		}
 	}
+	lufd_unlock(flags);
 	return true;
 out:
 	return cpumask_empty(&batch->cpumask);
@@ -242,10 +266,12 @@ bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
 			unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		goto out;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		unsigned long done;
 
@@ -253,6 +279,7 @@ bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
 		if (!ugen_before(done, ugen))
 			cpumask_clear_cpu(cpu, &batch->cpumask);
 	}
+	lufd_unlock(flags);
 out:
 	return cpumask_empty(&batch->cpumask);
 }
@@ -261,10 +288,12 @@ void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
 			     unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		return;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
 		unsigned long old = atomic_long_read(done);
@@ -282,15 +311,18 @@ void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
 		 */
 		atomic_long_cmpxchg(done, old, ugen);
 	}
+	lufd_unlock(flags);
 }
 
 void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		return;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, mm_cpumask(mm)) {
 		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
 		unsigned long old = atomic_long_read(done);
@@ -308,4 +340,5 @@ void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
 		 */
 		atomic_long_cmpxchg(done, old, ugen);
 	}
+	lufd_unlock(flags);
 }
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 593f10aabd45a..414bcabb23b51 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -695,12 +695,22 @@ static inline pud_t pud_mkyoung(pud_t pud)
 	return pud_set_flags(pud, _PAGE_ACCESSED);
 }
 
+#ifdef CONFIG_LUF_DEBUG
+pud_t pud_mkwrite(pud_t pud);
+static inline pud_t __pud_mkwrite(pud_t pud)
+{
+	pud = pud_set_flags(pud, _PAGE_RW);
+
+	return pud_clear_saveddirty(pud);
+}
+#else
 static inline pud_t pud_mkwrite(pud_t pud)
 {
 	pud = pud_set_flags(pud, _PAGE_RW);
 
 	return pud_clear_saveddirty(pud);
 }
+#endif
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline int pte_soft_dirty(pte_t pte)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 58ad7e6989bb1..b667987dbd31b 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -297,6 +297,9 @@ extern bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch, un
 extern bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
 extern void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch, unsigned long ugen);
 extern void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen);
+#ifdef CONFIG_LUF_DEBUG
+extern void print_lufd_arch(void);
+#endif
 
 static inline void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch)
 {
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 1fef5ad32d5a8..d0b7a1437214c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -904,6 +904,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
+	lufd_check_pages(pte_page(pte), 0);
 	if (vma->vm_flags & VM_SHADOW_STACK)
 		return pte_mkwrite_shstk(pte);
 
@@ -914,6 +915,7 @@ pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
 
 pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
+	lufd_check_pages(pmd_page(pmd), PMD_ORDER);
 	if (vma->vm_flags & VM_SHADOW_STACK)
 		return pmd_mkwrite_shstk(pmd);
 
@@ -922,6 +924,14 @@ pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd_clear_saveddirty(pmd);
 }
 
+#ifdef CONFIG_LUF_DEBUG
+pud_t pud_mkwrite(pud_t pud)
+{
+	lufd_check_pages(pud_page(pud), PUD_ORDER);
+	return __pud_mkwrite(pud);
+}
+#endif
+
 void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte)
 {
 	/*
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index be6068b60c32d..99b3d54aa74d2 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1283,6 +1283,25 @@ static int __init luf_init_arch(void)
 }
 early_initcall(luf_init_arch);
 
+#ifdef CONFIG_LUF_DEBUG
+static DEFINE_SPINLOCK(luf_debug_lock);
+#define lufd_lock(f) spin_lock_irqsave(&luf_debug_lock, (f))
+#define lufd_unlock(f) spin_unlock_irqrestore(&luf_debug_lock, (f))
+
+void print_lufd_arch(void)
+{
+	int cpu;
+
+	pr_cont("LUFD ARCH:");
+	for_each_cpu(cpu, cpu_possible_mask)
+		pr_cont(" %lu", atomic_long_read(per_cpu_ptr(&ugen_done, cpu)));
+	pr_cont("\n");
+}
+#else
+#define lufd_lock(f) do { (void)(f); } while(0)
+#define lufd_unlock(f) do { (void)(f); } while(0)
+#endif
+
 /*
  * batch will not be updated.
  */
@@ -1290,17 +1309,22 @@ bool arch_tlbbatch_check_done(struct arch_tlbflush_unmap_batch *batch,
 			unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		goto out;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		unsigned long done;
 
 		done = atomic_long_read(per_cpu_ptr(&ugen_done, cpu));
-		if (ugen_before(done, ugen))
+		if (ugen_before(done, ugen)) {
+			lufd_unlock(flags);
 			return false;
+		}
 	}
+	lufd_unlock(flags);
 	return true;
 out:
 	return cpumask_empty(&batch->cpumask);
@@ -1310,10 +1334,12 @@ bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
 			unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		goto out;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		unsigned long done;
 
@@ -1321,6 +1347,7 @@ bool arch_tlbbatch_diet(struct arch_tlbflush_unmap_batch *batch,
 		if (!ugen_before(done, ugen))
 			cpumask_clear_cpu(cpu, &batch->cpumask);
 	}
+	lufd_unlock(flags);
 out:
 	return cpumask_empty(&batch->cpumask);
 }
@@ -1329,10 +1356,12 @@ void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
 			     unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		return;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, &batch->cpumask) {
 		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
 		unsigned long old = atomic_long_read(done);
@@ -1350,15 +1379,18 @@ void arch_tlbbatch_mark_ugen(struct arch_tlbflush_unmap_batch *batch,
 		 */
 		atomic_long_cmpxchg(done, old, ugen);
 	}
+	lufd_unlock(flags);
 }
 
 void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
 {
 	int cpu;
+	unsigned long flags;
 
 	if (!ugen)
 		return;
 
+	lufd_lock(flags);
 	for_each_cpu(cpu, mm_cpumask(mm)) {
 		atomic_long_t *done = per_cpu_ptr(&ugen_done, cpu);
 		unsigned long old = atomic_long_read(done);
@@ -1376,6 +1408,7 @@ void arch_mm_mark_ugen(struct mm_struct *mm, unsigned long ugen)
 		 */
 		atomic_long_cmpxchg(done, old, ugen);
 	}
+	lufd_unlock(flags);
 }
 
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
index dd100e849f5e0..0792530d1be7b 100644
--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -41,6 +41,7 @@ static inline void *kmap(struct page *page)
 {
 	void *addr;
 
+	lufd_check_pages(page, 0);
 	might_sleep();
 	if (!PageHighMem(page))
 		addr = page_address(page);
@@ -161,6 +162,7 @@ static inline struct page *kmap_to_page(void *addr)
 
 static inline void *kmap(struct page *page)
 {
+	lufd_check_pages(page, 0);
 	might_sleep();
 	return page_address(page);
 }
@@ -177,11 +179,13 @@ static inline void kunmap(struct page *page)
 
 static inline void *kmap_local_page(struct page *page)
 {
+	lufd_check_pages(page, 0);
 	return page_address(page);
 }
 
 static inline void *kmap_local_folio(struct folio *folio, size_t offset)
 {
+	lufd_check_folio(folio);
 	return page_address(&folio->page) + offset;
 }
 
@@ -204,6 +208,7 @@ static inline void __kunmap_local(const void *addr)
 
 static inline void *kmap_atomic(struct page *page)
 {
+	lufd_check_pages(page, 0);
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		migrate_disable();
 	else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b41d7804a06a2..5304477e7da8e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -46,6 +46,24 @@ extern int sysctl_page_lock_unfairness;
 void mm_core_init(void);
 void init_mm_internals(void);
 
+#ifdef CONFIG_LUF_DEBUG
+void lufd_check_folio(struct folio *f);
+void lufd_check_pages(const struct page *p, unsigned int order);
+void lufd_check_zone_pages(struct zone *zone, struct page *page, unsigned int order);
+void lufd_check_queued_pages(void);
+void lufd_queue_page_for_check(struct page *page, int order);
+void lufd_mark_folio(struct folio *f, unsigned short luf_key);
+void lufd_mark_pages(struct page *p, unsigned int order, unsigned short luf_key);
+#else
+static inline void lufd_check_folio(struct folio *f) {}
+static inline void lufd_check_pages(const struct page *p, unsigned int order) {}
+static inline void lufd_check_zone_pages(struct zone *zone, struct page *page, unsigned int order) {}
+static inline void lufd_check_queued_pages(void) {}
+static inline void lufd_queue_page_for_check(struct page *page, int order) {}
+static inline void lufd_mark_folio(struct folio *f, unsigned short luf_key) {}
+static inline void lufd_mark_pages(struct page *p, unsigned int order, unsigned short luf_key) {}
+#endif
+
 #ifndef CONFIG_NUMA		/* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
 
@@ -115,7 +133,7 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #endif
 
 #ifndef page_to_virt
-#define page_to_virt(x)	__va(PFN_PHYS(page_to_pfn(x)))
+#define page_to_virt(x)	({ lufd_check_pages(x, 0); __va(PFN_PHYS(page_to_pfn(x)));})
 #endif
 
 #ifndef lm_alias
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a1d80ffafe338..30d29a6f9db4c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,6 +23,10 @@
 
 #include <asm/mmu.h>
 
+#ifdef CONFIG_LUF_DEBUG
+extern struct page_ext_operations luf_debug_ops;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -33,18 +37,6 @@
 struct address_space;
 struct mem_cgroup;
 
-#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-struct luf_batch {
-	struct tlbflush_unmap_batch batch;
-	unsigned long ugen;
-	rwlock_t lock;
-};
-void luf_batch_init(struct luf_batch *lb);
-#else
-struct luf_batch {};
-static inline void luf_batch_init(struct luf_batch *lb) {}
-#endif
-
 /*
  * Each physical page in the system has a struct page associated with
  * it to keep track of whatever it is we are using the page for at the
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index a82aa80c0ba46..3b87f8674e528 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -10,6 +10,7 @@
 
 #include <linux/align.h>
 #include <linux/types.h>
+#include <linux/spinlock_types.h>
 
 #include <asm/page.h>
 
@@ -88,4 +89,19 @@ struct tlbflush_unmap_batch {
 #endif
 };
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+struct luf_batch {
+	struct tlbflush_unmap_batch batch;
+	unsigned long ugen;
+	rwlock_t lock;
+};
+void luf_batch_init(struct luf_batch *lb);
+#else
+struct luf_batch {};
+static inline void luf_batch_init(struct luf_batch *lb) {}
+#endif
+
+#if defined(CONFIG_LUF_DEBUG)
+#define NR_LUFD_PAGES 512
+#endif
 #endif /* _LINUX_MM_TYPES_TASK_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 96375274d0335..9cb8e6fa1b1b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1406,6 +1406,11 @@ struct task_struct {
 	unsigned long luf_ugen;
 	unsigned long zone_ugen;
 	unsigned long wait_zone_ugen;
+#if defined(CONFIG_LUF_DEBUG)
+	struct page *lufd_pages[NR_LUFD_PAGES];
+	int lufd_pages_order[NR_LUFD_PAGES];
+	int lufd_pages_nr;
+#endif
 #endif
 
 	struct tlbflush_unmap_batch	tlb_ubc;
diff --git a/mm/highmem.c b/mm/highmem.c
index ef3189b36cadb..a323d5a655bf9 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -576,6 +576,7 @@ void *__kmap_local_page_prot(struct page *page, pgprot_t prot)
 {
 	void *kmap;
 
+	lufd_check_pages(page, 0);
 	/*
 	 * To broaden the usage of the actual kmap_local() machinery always map
 	 * pages when debugging is enabled and the architecture has no problems
diff --git a/mm/memory.c b/mm/memory.c
index 62137ab258d2c..26e8b73436eab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6259,6 +6259,18 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mapping = vma->vm_file->f_mapping;
 	}
 
+#ifdef CONFIG_LUF_DEBUG
+	if (luf_flush) {
+		/*
+		 * If it has a VM_SHARED mapping, all the mms involved
+		 * in the struct address_space should be luf_flush'ed.
+		 */
+		if (mapping)
+			luf_flush_mapping(mapping);
+		luf_flush_mm(mm);
+	}
+#endif
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
 	else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9a58d6f7a9609..8a114a4339d68 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -758,6 +758,8 @@ void luf_takeoff_end(struct zone *zone)
 		VM_WARN_ON(current->zone_ugen);
 		VM_WARN_ON(current->wait_zone_ugen);
 	}
+
+	lufd_check_queued_pages();
 }
 
 /*
@@ -853,8 +855,10 @@ bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 		struct luf_batch *lb;
 		unsigned long lb_ugen;
 
-		if (!luf_key)
+		if (!luf_key) {
+			lufd_check_pages(page, buddy_order(page));
 			return true;
+		}
 
 		lb = &luf_batch[luf_key];
 		read_lock_irqsave(&lb->lock, flags);
@@ -875,12 +879,15 @@ bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 
 		if (!current->luf_ugen || ugen_before(current->luf_ugen, lb_ugen))
 			current->luf_ugen = lb_ugen;
+		lufd_queue_page_for_check(page, buddy_order(page));
 		return true;
 	}
 
 	zone_ugen = page_zone_ugen(zone, page);
-	if (!zone_ugen)
+	if (!zone_ugen) {
+		lufd_check_pages(page, buddy_order(page));
 		return true;
+	}
 
 	/*
 	 * Should not be zero since zone-zone_ugen has been updated in
@@ -888,17 +895,23 @@ bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 	 */
 	VM_WARN_ON(!zone->zone_ugen);
 
-	if (!ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen))
+	if (!ugen_before(READ_ONCE(zone->zone_ugen_done), zone_ugen)) {
+		lufd_check_pages(page, buddy_order(page));
 		return true;
+	}
 
 	if (current->luf_no_shootdown)
 		return false;
 
+	lufd_check_zone_pages(zone, page, buddy_order(page));
+
 	/*
 	 * zone batched flush has been already set.
 	 */
-	if (current->zone_ugen)
+	if (current->zone_ugen) {
+		lufd_queue_page_for_check(page, buddy_order(page));
 		return true;
+	}
 
 	/*
 	 * Others are already performing tlb shootdown for us.  All we
@@ -933,6 +946,7 @@ bool luf_takeoff_check_and_fold(struct zone *zone, struct page *page)
 		atomic_long_set(&zone->nr_luf_pages, 0);
 		fold_batch(tlb_ubc_takeoff, &zone->zone_batch, true);
 	}
+	lufd_queue_page_for_check(page, buddy_order(page));
 	return true;
 }
 #endif
@@ -1238,6 +1252,11 @@ static inline void __free_one_page(struct page *page,
 	} else
 		zone_ugen = page_zone_ugen(zone, page);
 
+	if (!zone_ugen)
+		lufd_check_pages(page, order);
+	else
+		lufd_check_zone_pages(zone, page, order);
+
 	while (order < MAX_PAGE_ORDER) {
 		int buddy_mt = migratetype;
 		unsigned long buddy_zone_ugen;
@@ -1299,6 +1318,10 @@ static inline void __free_one_page(struct page *page,
 		set_page_zone_ugen(page, zone_ugen);
 		pfn = combined_pfn;
 		order++;
+		if (!zone_ugen)
+			lufd_check_pages(page, order);
+		else
+			lufd_check_zone_pages(zone, page, order);
 	}
 
 done_merging:
@@ -3246,6 +3269,8 @@ void free_frozen_pages(struct page *page, unsigned int order,
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
+	lufd_mark_pages(page, order, luf_key);
+
 	if (!pcp_allowed_order(order)) {
 		__free_pages_ok(page, order, FPI_NONE, luf_key);
 		return;
@@ -3298,6 +3323,7 @@ void free_unref_folios(struct folio_batch *folios, unsigned short luf_key)
 		unsigned long pfn = folio_pfn(folio);
 		unsigned int order = folio_order(folio);
 
+		lufd_mark_folio(folio, luf_key);
 		if (!free_pages_prepare(&folio->page, order))
 			continue;
 		/*
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 641d93f6af4c1..be40bc2a93378 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -89,6 +89,9 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
 #ifdef CONFIG_PAGE_TABLE_CHECK
 	&page_table_check_ops,
 #endif
+#ifdef CONFIG_LUF_DEBUG
+	&luf_debug_ops,
+#endif
 };
 
 unsigned long page_ext_size;
diff --git a/mm/rmap.c b/mm/rmap.c
index 124ef59afa25e..11bdbbc47ad11 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1161,6 +1161,235 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 }
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
+#ifdef CONFIG_LUF_DEBUG
+
+static bool need_luf_debug(void)
+{
+	return true;
+}
+
+static void init_luf_debug(void)
+{
+	/* Do nothing */
+}
+
+struct page_ext_operations luf_debug_ops = {
+	.size = sizeof(struct luf_batch),
+	.need = need_luf_debug,
+	.init = init_luf_debug,
+	.need_shared_flags = false,
+};
+
+static bool __lufd_check_zone_pages(struct page *page, int nr,
+		struct tlbflush_unmap_batch *batch, unsigned long ugen)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		struct page_ext *page_ext;
+		struct luf_batch *lb;
+		unsigned long lb_ugen;
+		unsigned long flags;
+		bool ret;
+
+		page_ext = page_ext_get(page + i);
+		if (!page_ext)
+			continue;
+
+		lb = (struct luf_batch *)page_ext_data(page_ext, &luf_debug_ops);
+		write_lock_irqsave(&lb->lock, flags);
+		lb_ugen = lb->ugen;
+		ret = arch_tlbbatch_done(&lb->batch.arch, &batch->arch);
+		write_unlock_irqrestore(&lb->lock, flags);
+		page_ext_put(page_ext);
+
+		if (!ret || ugen_before(ugen, lb_ugen))
+			return false;
+	}
+	return true;
+}
+
+void lufd_check_zone_pages(struct zone *zone, struct page *page, unsigned int order)
+{
+	bool warn;
+	static bool once = false;
+
+	if (!page || !zone)
+		return;
+
+	warn = !__lufd_check_zone_pages(page, 1 << order,
+			&zone->zone_batch, zone->luf_ugen);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) order(%u)\n",
+				atomic_long_read(&luf_ugen), page, order);
+		print_lufd_arch();
+	}
+}
+
+static bool __lufd_check_pages(const struct page *page, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		struct page_ext *page_ext;
+		struct luf_batch *lb;
+		unsigned long lb_ugen;
+		unsigned long flags;
+		bool ret;
+
+		page_ext = page_ext_get(page + i);
+		if (!page_ext)
+			continue;
+
+		lb = (struct luf_batch *)page_ext_data(page_ext, &luf_debug_ops);
+		write_lock_irqsave(&lb->lock, flags);
+		lb_ugen = lb->ugen;
+		ret = arch_tlbbatch_diet(&lb->batch.arch, lb_ugen);
+		write_unlock_irqrestore(&lb->lock, flags);
+		page_ext_put(page_ext);
+
+		if (!ret)
+			return false;
+	}
+	return true;
+}
+
+void lufd_queue_page_for_check(struct page *page, int order)
+{
+	struct page **parray = current->lufd_pages;
+	int *oarray = current->lufd_pages_order;
+
+	if (!page)
+		return;
+
+	if (current->lufd_pages_nr >= NR_LUFD_PAGES) {
+		VM_WARN_ONCE(1, "LUFD: NR_LUFD_PAGES is too small.\n");
+		return;
+	}
+
+	*(parray + current->lufd_pages_nr) = page;
+	*(oarray + current->lufd_pages_nr) = order;
+	current->lufd_pages_nr++;
+}
+
+void lufd_check_queued_pages(void)
+{
+	struct page **parray = current->lufd_pages;
+	int *oarray = current->lufd_pages_order;
+	int i;
+
+	for (i = 0; i < current->lufd_pages_nr; i++)
+		lufd_check_pages(*(parray + i), *(oarray + i));
+	current->lufd_pages_nr = 0;
+}
+
+void lufd_check_folio(struct folio *folio)
+{
+	struct page *page;
+	int nr;
+	bool warn;
+	static bool once = false;
+
+	if (!folio)
+		return;
+
+	page = folio_page(folio, 0);
+	nr = folio_nr_pages(folio);
+
+	warn = !__lufd_check_pages(page, nr);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) nr(%d)\n",
+				atomic_long_read(&luf_ugen), page, nr);
+		print_lufd_arch();
+	}
+}
+EXPORT_SYMBOL(lufd_check_folio);
+
+void lufd_check_pages(const struct page *page, unsigned int order)
+{
+	bool warn;
+	static bool once = false;
+
+	if (!page)
+		return;
+
+	warn = !__lufd_check_pages(page, 1 << order);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) order(%u)\n",
+				atomic_long_read(&luf_ugen), page, order);
+		print_lufd_arch();
+	}
+}
+EXPORT_SYMBOL(lufd_check_pages);
+
+static void __lufd_mark_pages(struct page *page, int nr, unsigned short luf_key)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		struct page_ext *page_ext;
+		struct luf_batch *lb;
+
+		page_ext = page_ext_get(page + i);
+		if (!page_ext)
+			continue;
+
+		lb = (struct luf_batch *)page_ext_data(page_ext, &luf_debug_ops);
+		fold_luf_batch(lb, &luf_batch[luf_key]);
+		page_ext_put(page_ext);
+	}
+}
+
+void lufd_mark_folio(struct folio *folio, unsigned short luf_key)
+{
+	struct page *page;
+	int nr;
+	bool warn;
+	static bool once = false;
+
+	if (!luf_key)
+		return;
+
+	page = folio_page(folio, 0);
+	nr = folio_nr_pages(folio);
+
+	warn = !__lufd_check_pages(page, nr);
+	__lufd_mark_pages(page, nr, luf_key);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) nr(%d)\n",
+				atomic_long_read(&luf_ugen), page, nr);
+		print_lufd_arch();
+	}
+}
+
+void lufd_mark_pages(struct page *page, unsigned int order, unsigned short luf_key)
+{
+	bool warn;
+	static bool once = false;
+
+	if (!luf_key)
+		return;
+
+	warn = !__lufd_check_pages(page, 1 << order);
+	__lufd_mark_pages(page, 1 << order, luf_key);
+
+	if (warn && !READ_ONCE(once)) {
+		WRITE_ONCE(once, true);
+		VM_WARN(1, "LUFD: ugen(%lu) page(%p) order(%u)\n",
+				atomic_long_read(&luf_ugen), page, order);
+		print_lufd_arch();
+	}
+}
+#endif
+
 /**
  * page_address_in_vma - The virtual address of a page in this VMA.
  * @folio: The folio containing the page.
-- 
2.17.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 15:29   ` Vlastimil Babka
  2025-02-20 23:37     ` Byungchul Park
@ 2025-02-22  1:14     ` Shakeel Butt
  1 sibling, 0 replies; 102+ messages in thread
From: Shakeel Butt @ 2025-02-22  1:14 UTC (permalink / raw)
  To: Vlastimil Babka, sj
  Cc: Dave Hansen, Byungchul Park, linux-kernel, linux-mm, kernel_team,
	akpm, ying.huang, vernhao, mgorman, hughd, willy, david, peterz,
	luto, tglx, mingo, bp, dave.hansen, rjgolo

On Thu, Feb 20, 2025 at 04:29:51PM +0100, Vlastimil Babka wrote:
> On 2/20/25 16:15, Dave Hansen wrote:
> > On 2/19/25 21:20, Byungchul Park wrote:
> >> I'm posting the latest version so that anyone can try luf mechanism if
> >> wanted by any chance.  However, I tagged RFC again because there are
> >> still issues that should be resolved to merge to mainline:
> > 
> > I don't see anything fundamentally different here from the last 11
> > versions. I think the entire approach is dangerous and basically makes
> > things impossible to debug. It's not clear that some of the failure
> > scenarios that I've brought up in the past have actually been fixed.
> 
> Yes, and it's still an invasive change to the buddy allocator.
> IIRC at Plumbers the opinion in the audience was that there might be ways to
> improve the batching on unmap to reduce the flushes without such an invasive
> and potentially dangerous change? Has that been investigated?
> 

I know SJ (CCed) is working on making TLB flush batching work for
process_madvise().



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90%
  2025-02-20 15:15 ` Dave Hansen
  2025-02-20 15:29   ` Vlastimil Babka
@ 2025-02-20 23:23   ` Byungchul Park
  1 sibling, 0 replies; 102+ messages in thread
From: Byungchul Park @ 2025-02-20 23:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, kernel_team, akpm, ying.huang, vernhao,
	mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp,
	dave.hansen, rjgolo

On Thu, Feb 20, 2025 at 07:15:44AM -0800, Dave Hansen wrote:
> On 2/19/25 21:20, Byungchul Park wrote:
> > I'm posting the latest version so that anyone can try luf mechanism if
> > wanted by any chance.  However, I tagged RFC again because there are
> > still issues that should be resolved to merge to mainline:
> 
> I don't see anything fundamentally different here from the last 11
> versions. I think the entire approach is dangerous and basically makes
> things impossible to debug. It's not clear that some of the failure
> scenarios that I've brought up in the past have actually been fixed.

Respect your opinion.

> What I've said here still stands:
> 
> > https://lore.kernel.org/all/fab1dd64-c652-4160-93b4-7b483a8874da@intel.com/
> 
> > I think tglx would call all of this "tinkering".  The approach to this
> > series is to "fix" narrow, specific cases that reviewers point out, make
> > it compile, then send it out again, hoping someone will apply it.
> > 
> > So, for me, until the approach to this series changes: NAK, for x86.
> > Andrew, please don't take this series.  Or, if you do, please drop the
> > patch enabling it on x86.
> 
> I think I'd also like to stop being cc'd on this. If LUF is merged into

I will un-cc you from the next spin.

	Byungchul

> mainline and proven to work on arm64 or riscv for a year, I'd be happy
> to take another look at enabling it on x86. I think that's just about
> the only thing that would make me reconsider.


^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2025-06-20 17:00 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-20  5:20 [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 01/26] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 02/26] arm64/tlbflush: " Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 03/26] riscv/tlb: " Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 04/26] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_flush() Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 05/26] mm/buddy: make room for a new variable, luf_key, in struct page Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 06/26] mm: move should_skip_kasan_poison() to mm/internal.h Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 07/26] mm: introduce luf_ugen to be used as a global timestamp Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 08/26] mm: introduce luf_batch to be used as hash table to store luf meta data Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 09/26] mm: introduce API to perform tlb shootdown on exit from page allocator Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 10/26] mm: introduce APIs to check if the page allocation is tlb shootdownable Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 11/26] mm: deliver luf_key to pcp or buddy on free after unmapping Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 12/26] mm: delimit critical sections to take off pages from pcp or buddy alloctor Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 13/26] mm: introduce pend_list in struct free_area to track luf'd pages Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 14/26] mm/rmap: recognize read-only tlb entries during batched tlb flush Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 15/26] fs, filemap: refactor to gather the scattered ->write_{begin,end}() calls Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 16/26] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 17/26] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 18/26] mm/page_alloc: retry 3 times to take pcp pages on luf check failure Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 19/26] mm: skip luf tlb flush for luf'd mm that already has been done Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 20/26] mm, fs: skip tlb flushes for luf'd filemap " Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 21/26] mm: perform luf tlb shootdown per zone in batched manner Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 22/26] mm/page_alloc: not allow to tlb shootdown if !preemptable() && non_luf_pages_ok() Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 23/26] mm: separate move/undo parts from migrate_pages_batch() Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 24/26] mm/migrate: apply luf mechanism to unmapping during migration Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 25/26] mm/vmscan: apply luf mechanism to unmapping during folio reclaim Byungchul Park
2025-02-20  5:20 ` [RFC PATCH v12 26/26] mm/luf: implement luf debug feature Byungchul Park
2025-02-20 10:32 ` [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Hillf Danton
2025-02-20 10:51   ` Byungchul Park
2025-02-20 11:09   ` Byungchul Park
2025-02-20 11:49     ` Hillf Danton
2025-02-20 12:20       ` Byungchul Park
2025-02-20 12:40       ` Byungchul Park
2025-02-20 13:54       ` Matthew Wilcox
2025-02-20 15:09         ` Steven Rostedt
2025-02-20 22:53           ` Kent Overstreet
2025-02-20 23:05             ` Steven Rostedt
2025-02-20 23:21               ` Kent Overstreet
2025-02-20 23:25           ` Hillf Danton
2025-02-20 23:44             ` Steven Rostedt
     [not found]             ` <20250221230556.2479-1-hdanton@sina.com>
2025-02-22  7:16               ` Greg KH
     [not found]               ` <20250222101100.2531-1-hdanton@sina.com>
2025-02-22 13:57                 ` Greg KH
2025-03-10 23:24       ` Dan Williams
2025-03-10 23:53         ` Barry Song
     [not found]       ` <20250619134922.1219-1-hdanton@sina.com>
2025-06-20 17:00         ` Dan Williams
2025-02-20 15:15 ` Dave Hansen
2025-02-20 15:29   ` Vlastimil Babka
2025-02-20 23:37     ` Byungchul Park
2025-02-26 11:30       ` RFC v12 rebased on v6.14-rc4 Byungchul Park
2025-02-26 12:03         ` [RFC PATCH v12 based on v6.14-rc4 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 02/25] arm64/tlbflush: " Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 03/25] riscv/tlb: " Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 04/25] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_flush() Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 05/25] mm/buddy: make room for a new variable, luf_key, in struct page Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 06/25] mm: move should_skip_kasan_poison() to mm/internal.h Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 07/25] mm: introduce luf_ugen to be used as a global timestamp Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 08/25] mm: introduce luf_batch to be used as hash table to store luf meta data Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 09/25] mm: introduce API to perform tlb shootdown on exit from page allocator Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 10/25] mm: introduce APIs to check if the page allocation is tlb shootdownable Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 11/25] mm: deliver luf_key to pcp or buddy on free after unmapping Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 12/25] mm: delimit critical sections to take off pages from pcp or buddy alloctor Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 13/25] mm: introduce pend_list in struct free_area to track luf'd pages Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 14/25] mm/rmap: recognize read-only tlb entries during batched tlb flush Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 15/25] fs, filemap: refactor to gather the scattered ->write_{begin,end}() calls Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 16/25] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 17/25] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 18/25] mm/page_alloc: retry 3 times to take pcp pages on luf check failure Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 19/25] mm: skip luf tlb flush for luf'd mm that already has been done Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 20/25] mm, fs: skip tlb flushes for luf'd filemap " Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 21/25] mm: perform luf tlb shootdown per zone in batched manner Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 22/25] mm/page_alloc: not allow to tlb shootdown if !preemptable() && non_luf_pages_ok() Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 23/25] mm/migrate: apply luf mechanism to unmapping during migration Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 24/25] mm/vmscan: apply luf mechanism to unmapping during folio reclaim Byungchul Park
2025-02-26 12:03           ` [RFC PATCH v12 based on v6.14-rc4 25/25] mm/luf: implement luf debug feature Byungchul Park
2025-02-26 11:33       ` RFC v12 rebased on mm-unstable as of Feb 21, 2025 Byungchul Park
2025-02-26 12:01         ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 01/25] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 02/25] arm64/tlbflush: " Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 03/25] riscv/tlb: " Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 04/25] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_flush() Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 05/25] mm/buddy: make room for a new variable, luf_key, in struct page Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 06/25] mm: move should_skip_kasan_poison() to mm/internal.h Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 07/25] mm: introduce luf_ugen to be used as a global timestamp Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 08/25] mm: introduce luf_batch to be used as hash table to store luf meta data Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 09/25] mm: introduce API to perform tlb shootdown on exit from page allocator Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 10/25] mm: introduce APIs to check if the page allocation is tlb shootdownable Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 11/25] mm: deliver luf_key to pcp or buddy on free after unmapping Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 12/25] mm: delimit critical sections to take off pages from pcp or buddy alloctor Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 13/25] mm: introduce pend_list in struct free_area to track luf'd pages Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 14/25] mm/rmap: recognize read-only tlb entries during batched tlb flush Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 15/25] fs, filemap: refactor to gather the scattered ->write_{begin,end}() calls Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 16/25] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 17/25] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already have been done Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 18/25] mm/page_alloc: retry 3 times to take pcp pages on luf check failure Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 19/25] mm: skip luf tlb flush for luf'd mm that already has been done Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 20/25] mm, fs: skip tlb flushes for luf'd filemap " Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 21/25] mm: perform luf tlb shootdown per zone in batched manner Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 22/25] mm/page_alloc: not allow to tlb shootdown if !preemptable() && non_luf_pages_ok() Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 23/25] mm/migrate: apply luf mechanism to unmapping during migration Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 24/25] mm/vmscan: apply luf mechanism to unmapping during folio reclaim Byungchul Park
2025-02-26 12:01           ` [RFC PATCH v12 based on mm-unstable as of Feb 21, 2025 25/25] mm/luf: implement luf debug feature Byungchul Park
2025-02-22  1:14     ` [RFC PATCH v12 00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Shakeel Butt
2025-02-20 23:23   ` Byungchul Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox