linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O
@ 2025-11-27  1:14 Barry Song
  2025-11-27  1:14 ` [RFC PATCH 1/2] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Barry Song @ 2025-11-27  1:14 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: Barry Song, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexandre Ghiti, Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Wilcox, Pedro Falcato,
	Jarkko Sakkinen, Oscar Salvador, Kuninori Morimoto, Oven Liyang,
	Mark Rutland, Ada Couprie Diaz, Robin Murphy,
	Kristina Martšenko, Kevin Brodsky, Yeoreum Yun, Wentao Guan,
	Thorsten Blum, Steven Rostedt, Yunhui Cui, Nam Cao, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	linux-fsdevel

From: Barry Song <v-songbaohua@oppo.com>

Oven observed most mmap_lock contention and priority inversion
come from page fault retries after waiting for I/O completion.
Oven subsequently raised the following idea:

There is no need to always fall back to mmap_lock if the per-VMA
lock was released only to wait for pagecache or swapcache to
become ready.

In this case, the retry path can continue using the per-VMA lock.
This is a big win: it greatly reduces mmap_lock acquisitions.

Oven Liyang (1):
  mm/filemap: Retry fault by VMA lock if the lock was released for I/O

Barry Song (1):
  mm/swapin: Retry swapin by VMA lock if the lock was released for I/O

 arch/arm/mm/fault.c       |  5 +++++
 arch/arm64/mm/fault.c     |  5 +++++
 arch/loongarch/mm/fault.c |  4 ++++
 arch/powerpc/mm/fault.c   |  5 ++++-
 arch/riscv/mm/fault.c     |  4 ++++
 arch/s390/mm/fault.c      |  4 ++++
 arch/x86/mm/fault.c       |  4 ++++
 include/linux/mm_types.h  |  9 +++++----
 mm/filemap.c              |  5 ++++-
 mm/memory.c               | 10 ++++++++--
 10 files changed, 47 insertions(+), 8 deletions(-)

Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <pjw@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Oven Liyang <liyangouwen1@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Kristina Martšenko <kristina.martsenko@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: Wentao Guan <guanwentao@uniontech.com>
Cc: Thorsten Blum <thorsten.blum@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Yunhui Cui <cuiyunhui@bytedance.com>
Cc: Nam Cao <namcao@linutronix.de>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: loongarch@lists.linux.dev
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org

-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 1/2] mm/filemap: Retry fault by VMA lock if the lock was released for I/O
  2025-11-27  1:14 [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O Barry Song
@ 2025-11-27  1:14 ` Barry Song
  2025-11-27 10:52   ` Pedro Falcato
  2025-11-27  1:14 ` [RFC PATCH 2/2] mm/swapin: Retry swapin " Barry Song
  2025-11-27  4:09 ` [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O Matthew Wilcox
  2 siblings, 1 reply; 15+ messages in thread
From: Barry Song @ 2025-11-27  1:14 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: Oven Liyang, Russell King, Catalin Marinas, Will Deacon,
	Huacai Chen, WANG Xuerui, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexandre Ghiti, Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Wilcox, Pedro Falcato,
	Jarkko Sakkinen, Oscar Salvador, Kuninori Morimoto, Mark Rutland,
	Ada Couprie Diaz, Robin Murphy, Kristina Martšenko,
	Kevin Brodsky, Yeoreum Yun, Wentao Guan, Thorsten Blum,
	Steven Rostedt, Yunhui Cui, Nam Cao, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	linux-fsdevel, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song

From: Oven Liyang <liyangouwen1@oppo.com>

If the current page fault is using the per-VMA lock, and we only released
the lock to wait for I/O completion (e.g., using folio_lock()), then when
the fault is retried after the I/O completes, it should still qualify for
the per-VMA-lock path.

Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <pjw@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Kristina Martšenko <kristina.martsenko@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: Wentao Guan <guanwentao@uniontech.com>
Cc: Thorsten Blum <thorsten.blum@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Yunhui Cui <cuiyunhui@bytedance.com>
Cc: Nam Cao <namcao@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: loongarch@lists.linux.dev
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Cc: Chris Li <chrisl@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Oven Liyang <liyangouwen1@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/arm/mm/fault.c       | 5 +++++
 arch/arm64/mm/fault.c     | 5 +++++
 arch/loongarch/mm/fault.c | 4 ++++
 arch/powerpc/mm/fault.c   | 5 ++++-
 arch/riscv/mm/fault.c     | 4 ++++
 arch/s390/mm/fault.c      | 4 ++++
 arch/x86/mm/fault.c       | 4 ++++
 include/linux/mm_types.h  | 9 +++++----
 mm/filemap.c              | 5 ++++-
 9 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 2bc828a1940c..49fc0340821c 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -313,6 +313,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, addr);
 	if (!vma)
 		goto lock_mmap;
@@ -342,6 +343,10 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 			goto no_context;
 		return 0;
 	}
+
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 125dfa6c613b..842f50b99d3e 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -622,6 +622,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	if (!(mm_flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, addr);
 	if (!vma)
 		goto lock_mmap;
@@ -668,6 +669,10 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 			goto no_context;
 		return 0;
 	}
+
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/arch/loongarch/mm/fault.c b/arch/loongarch/mm/fault.c
index 2c93d33356e5..738f495560c0 100644
--- a/arch/loongarch/mm/fault.c
+++ b/arch/loongarch/mm/fault.c
@@ -219,6 +219,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs,
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, address);
 	if (!vma)
 		goto lock_mmap;
@@ -265,6 +266,9 @@ static void __kprobes __do_page_fault(struct pt_regs *regs,
 			no_context(regs, write, address);
 		return;
 	}
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 806c74e0d5ab..cb7ffc20c760 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -487,6 +487,7 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, address);
 	if (!vma)
 		goto lock_mmap;
@@ -516,7 +517,9 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 
 	if (fault_signal_pending(fault, regs))
 		return user_mode(regs) ? 0 : SIGBUS;
-
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 	/* When running in the kernel we expect faults to occur only to
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 04ed6f8acae4..b94cf57c2b9a 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -347,6 +347,7 @@ void handle_page_fault(struct pt_regs *regs)
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, addr);
 	if (!vma)
 		goto lock_mmap;
@@ -376,6 +377,9 @@ void handle_page_fault(struct pt_regs *regs)
 			no_context(regs, addr);
 		return;
 	}
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index e1ad05bfd28a..8d91c6495e13 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -286,6 +286,7 @@ static void do_exception(struct pt_regs *regs, int access)
 		flags |= FAULT_FLAG_WRITE;
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
+retry_vma:
 	vma = lock_vma_under_rcu(mm, address);
 	if (!vma)
 		goto lock_mmap;
@@ -310,6 +311,9 @@ static void do_exception(struct pt_regs *regs, int access)
 			handle_fault_error_nolock(regs, 0);
 		return;
 	}
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 retry:
 	vma = lock_mm_and_find_vma(mm, address, regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 998bd807fc7b..6023d0083903 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1324,6 +1324,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
+retry_vma:
 	vma = lock_vma_under_rcu(mm, address);
 	if (!vma)
 		goto lock_mmap;
@@ -1353,6 +1354,9 @@ void do_user_addr_fault(struct pt_regs *regs,
 						 ARCH_DEFAULT_PKEY);
 		return;
 	}
+	/* If the first try is only about waiting for the I/O to complete */
+	if (fault & VM_FAULT_RETRY_VMA)
+		goto retry_vma;
 lock_mmap:
 
 retry:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b71625378ce3..12b2d65ef1b9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1670,10 +1670,11 @@ enum vm_fault_reason {
 	VM_FAULT_NOPAGE         = (__force vm_fault_t)0x000100,
 	VM_FAULT_LOCKED         = (__force vm_fault_t)0x000200,
 	VM_FAULT_RETRY          = (__force vm_fault_t)0x000400,
-	VM_FAULT_FALLBACK       = (__force vm_fault_t)0x000800,
-	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
-	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
-	VM_FAULT_COMPLETED      = (__force vm_fault_t)0x004000,
+	VM_FAULT_RETRY_VMA      = (__force vm_fault_t)0x000800,
+	VM_FAULT_FALLBACK       = (__force vm_fault_t)0x001000,
+	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x002000,
+	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x004000,
+	VM_FAULT_COMPLETED      = (__force vm_fault_t)0x008000,
 	VM_FAULT_HINDEX_MASK    = (__force vm_fault_t)0x0f0000,
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 7d15a9c216ef..57dfd2211109 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3464,6 +3464,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	bool mapping_locked = false;
+	bool retry_by_vma_lock = false;
 
 	max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
 	if (unlikely(index >= max_idx))
@@ -3560,6 +3561,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 */
 	if (fpin) {
 		folio_unlock(folio);
+		if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+			retry_by_vma_lock = true;
 		goto out_retry;
 	}
 	if (mapping_locked)
@@ -3610,7 +3613,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 		filemap_invalidate_unlock_shared(mapping);
 	if (fpin)
 		fput(fpin);
-	return ret | VM_FAULT_RETRY;
+	return ret | VM_FAULT_RETRY | (retry_by_vma_lock ? VM_FAULT_RETRY_VMA : 0);
 }
 EXPORT_SYMBOL(filemap_fault);
 
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 2/2] mm/swapin: Retry swapin by VMA lock if the lock was released for I/O
  2025-11-27  1:14 [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O Barry Song
  2025-11-27  1:14 ` [RFC PATCH 1/2] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song
@ 2025-11-27  1:14 ` Barry Song
  2025-11-27  4:09 ` [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O Matthew Wilcox
  2 siblings, 0 replies; 15+ messages in thread
From: Barry Song @ 2025-11-27  1:14 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: Barry Song, David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Russell King, Catalin Marinas, Will Deacon, Huacai Chen,
	WANG Xuerui, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Matthew Wilcox,
	Pedro Falcato, Jarkko Sakkinen, Oscar Salvador,
	Kuninori Morimoto, Oven Liyang, Mark Rutland, Ada Couprie Diaz,
	Robin Murphy, Kristina Martšenko, Kevin Brodsky,
	Yeoreum Yun, Wentao Guan, Thorsten Blum, Steven Rostedt,
	Yunhui Cui, Nam Cao, Chris Li, Kairui Song, Kemeng Shi,
	Nhat Pham, Baoquan He, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, linux-fsdevel

From: Barry Song <v-songbaohua@oppo.com>

If the current do_swap_page() took the per-VMA lock and we dropped it only
to wait for I/O completion (e.g., use folio_wait_locked()), then when
do_swap_page() is retried after the I/O completes, it should still qualify
for the per-VMA-lock path.

Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul Walmsley <pjw@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Oven Liyang <liyangouwen1@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Kristina Martšenko <kristina.martsenko@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: Wentao Guan <guanwentao@uniontech.com>
Cc: Thorsten Blum <thorsten.blum@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Yunhui Cui <cuiyunhui@bytedance.com>
Cc: Nam Cao <namcao@linutronix.de>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: loongarch@lists.linux.dev
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4f933fedd33e..7f70f0324dcf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4654,6 +4654,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	unsigned long page_idx;
 	unsigned long address;
 	pte_t *ptep;
+	bool retry_by_vma_lock = false;
 
 	if (!pte_unmap_same(vmf))
 		goto out;
@@ -4758,8 +4759,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	swapcache = folio;
 	ret |= folio_lock_or_retry(folio, vmf);
-	if (ret & VM_FAULT_RETRY)
+	if (ret & VM_FAULT_RETRY) {
+		if (fault_flag_allow_retry_first(vmf->flags) &&
+		    !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT) &&
+		    (vmf->flags & FAULT_FLAG_VMA_LOCK))
+			retry_by_vma_lock = true;
 		goto out_release;
+	}
 
 	page = folio_file_page(folio, swp_offset(entry));
 	/*
@@ -5044,7 +5050,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 	if (si)
 		put_swap_device(si);
-	return ret;
+	return ret | (retry_by_vma_lock ? VM_FAULT_RETRY_VMA : 0);
 }
 
 static bool pte_range_none(pte_t *pte, int nr_pages)
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O
  2025-11-27  1:14 [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O Barry Song
  2025-11-27  1:14 ` [RFC PATCH 1/2] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song
  2025-11-27  1:14 ` [RFC PATCH 2/2] mm/swapin: Retry swapin " Barry Song
@ 2025-11-27  4:09 ` Matthew Wilcox
  2025-11-27  4:22   ` Barry Song
  2 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2025-11-27  4:09 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, Barry Song, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Pedro Falcato,
	Jarkko Sakkinen, Oscar Salvador, Kuninori Morimoto, Oven Liyang,
	Mark Rutland, Ada Couprie Diaz, Robin Murphy,
	Kristina Martšenko, Kevin Brodsky, Yeoreum Yun, Wentao Guan,
	Thorsten Blum, Steven Rostedt, Yunhui Cui, Nam Cao, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	linux-fsdevel

On Thu, Nov 27, 2025 at 09:14:36AM +0800, Barry Song wrote:
> There is no need to always fall back to mmap_lock if the per-VMA
> lock was released only to wait for pagecache or swapcache to
> become ready.

Something I've been wondering about is removing all the "drop the MM
locks while we wait for I/O" gunk.  It's a nice amount of code removed:

 include/linux/pagemap.h |  8 +---
 mm/filemap.c            | 98 ++++++++++++-------------------------------------
 mm/internal.h           | 21 -----------
 mm/memory.c             | 13 +------
 mm/shmem.c              |  6 ---
 5 files changed, 27 insertions(+), 119 deletions(-)

and I'm not sure we still need to do it with per-VMA locks.  What I
have here doesn't boot and I ran out of time to debug it.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O
  2025-11-27  4:09 ` [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O Matthew Wilcox
@ 2025-11-27  4:22   ` Barry Song
  2025-11-27  4:42     ` Barry Song
  2025-11-27 19:43     ` Matthew Wilcox
  0 siblings, 2 replies; 15+ messages in thread
From: Barry Song @ 2025-11-27  4:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, Barry Song, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Pedro Falcato,
	Jarkko Sakkinen, Oscar Salvador, Kuninori Morimoto, Oven Liyang,
	Mark Rutland, Ada Couprie Diaz, Robin Murphy,
	Kristina Martšenko, Kevin Brodsky, Yeoreum Yun, Wentao Guan,
	Thorsten Blum, Steven Rostedt, Yunhui Cui, Nam Cao, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	linux-fsdevel

On Thu, Nov 27, 2025 at 12:09 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Nov 27, 2025 at 09:14:36AM +0800, Barry Song wrote:
> > There is no need to always fall back to mmap_lock if the per-VMA
> > lock was released only to wait for pagecache or swapcache to
> > become ready.
>
> Something I've been wondering about is removing all the "drop the MM
> locks while we wait for I/O" gunk.  It's a nice amount of code removed:

I think the point is that page fault handlers should avoid holding the VMA
lock or mmap_lock for too long while waiting for I/O. Otherwise, those
writers and readers will be stuck for a while.

>
>  include/linux/pagemap.h |  8 +---
>  mm/filemap.c            | 98 ++++++++++++-------------------------------------
>  mm/internal.h           | 21 -----------
>  mm/memory.c             | 13 +------
>  mm/shmem.c              |  6 ---
>  5 files changed, 27 insertions(+), 119 deletions(-)
>
> and I'm not sure we still need to do it with per-VMA locks.  What I
> have here doesn't boot and I ran out of time to debug it.

I agree there’s room for improvement, but merely removing the "drop the MM
locks while waiting for I/O" code is unlikely to improve performance.

For example, we could change the flow to:
1. Release the VMA lock or mmap_lock
2. Lock the folio
3. Re-acquire the VMA lock or mmap_lock
4. Re-check whether we can still map the PTE
5. Map the PTE

Currently, the flow is always:

1. Release the VMA lock or mmap_lock
2. Lock the folio
3. Unlock the folio
4. Re-enter the page fault handling from the beginning

The change would be much more complex, so I’d prefer to land the current
patchset first. At least this way, we avoid falling back to mmap_lock and
causing contention or priority inversion, with minimal changes.

Thanks
Barry


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O
  2025-11-27  4:22   ` Barry Song
@ 2025-11-27  4:42     ` Barry Song
  2025-11-27 19:43     ` Matthew Wilcox
  1 sibling, 0 replies; 15+ messages in thread
From: Barry Song @ 2025-11-27  4:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, Barry Song, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Pedro Falcato,
	Jarkko Sakkinen, Oscar Salvador, Kuninori Morimoto, Oven Liyang,
	Mark Rutland, Ada Couprie Diaz, Robin Murphy,
	Kristina Martšenko, Kevin Brodsky, Yeoreum Yun, Wentao Guan,
	Thorsten Blum, Steven Rostedt, Yunhui Cui, Nam Cao, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	linux-fsdevel

On Thu, Nov 27, 2025 at 12:22 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Nov 27, 2025 at 12:09 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Nov 27, 2025 at 09:14:36AM +0800, Barry Song wrote:
> > > There is no need to always fall back to mmap_lock if the per-VMA
> > > lock was released only to wait for pagecache or swapcache to
> > > become ready.
> >
> > Something I've been wondering about is removing all the "drop the MM
> > locks while we wait for I/O" gunk.  It's a nice amount of code removed:
>
> I think the point is that page fault handlers should avoid holding the VMA
> lock or mmap_lock for too long while waiting for I/O. Otherwise, those
> writers and readers will be stuck for a while.
>
> >
> >  include/linux/pagemap.h |  8 +---
> >  mm/filemap.c            | 98 ++++++++++++-------------------------------------
> >  mm/internal.h           | 21 -----------
> >  mm/memory.c             | 13 +------
> >  mm/shmem.c              |  6 ---
> >  5 files changed, 27 insertions(+), 119 deletions(-)
> >
> > and I'm not sure we still need to do it with per-VMA locks.  What I
> > have here doesn't boot and I ran out of time to debug it.
>
> I agree there’s room for improvement, but merely removing the "drop the MM
> locks while waiting for I/O" code is unlikely to improve performance.
>

One idea I have is that we could conditionally remove the "drop lock and
retry page fault" step if we are reasonably sure the I/O has already
completed:

diff --git a/mm/filemap.c b/mm/filemap.c
index 57dfd2211109..151f6d38c284 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3517,7 +3517,9 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
                }
        }

-       if (!lock_folio_maybe_drop_mmap(vmf, folio, &fpin))
+       if (folio_test_uptodate(folio))
+               folio_lock(folio);
+       else if (!lock_folio_maybe_drop_mmap(vmf, folio, &fpin))
                goto out_retry;

        /* Did it get truncated? */
diff --git a/mm/memory.c b/mm/memory.c
index 7f70f0324dcf..355ed02560fd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4758,7 +4758,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
        }

        swapcache = folio;
-       ret |= folio_lock_or_retry(folio, vmf);
+       if (folio_test_uptodate(folio))
+               folio_lock(folio);
+       else
+               ret |= folio_lock_or_retry(folio, vmf);
        if (ret & VM_FAULT_RETRY) {
                if (fault_flag_allow_retry_first(vmf->flags) &&
                    !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT) &&

In that case, we are likely just waiting for the mapping to be completed by
another process. I may develop the above idea as an incremental patch after
this patchset.

Thanks
Barry


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] mm/filemap: Retry fault by VMA lock if the lock was released for I/O
  2025-11-27  1:14 ` [RFC PATCH 1/2] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song
@ 2025-11-27 10:52   ` Pedro Falcato
  2025-11-27 11:39     ` Barry Song
  0 siblings, 1 reply; 15+ messages in thread
From: Pedro Falcato @ 2025-11-27 10:52 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, Oven Liyang, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Matthew Wilcox,
	Jarkko Sakkinen, Oscar Salvador, Kuninori Morimoto, Mark Rutland,
	Ada Couprie Diaz, Robin Murphy, Kristina Martšenko,
	Kevin Brodsky, Yeoreum Yun, Wentao Guan, Thorsten Blum,
	Steven Rostedt, Yunhui Cui, Nam Cao, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	linux-fsdevel, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song

On Thu, Nov 27, 2025 at 09:14:37AM +0800, Barry Song wrote:
> From: Oven Liyang <liyangouwen1@oppo.com>
> 
> If the current page fault is using the per-VMA lock, and we only released
> the lock to wait for I/O completion (e.g., using folio_lock()), then when
> the fault is retried after the I/O completes, it should still qualify for
> the per-VMA-lock path.
> 
<snip>
> Signed-off-by: Oven Liyang <liyangouwen1@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  arch/arm/mm/fault.c       | 5 +++++
>  arch/arm64/mm/fault.c     | 5 +++++
>  arch/loongarch/mm/fault.c | 4 ++++
>  arch/powerpc/mm/fault.c   | 5 ++++-
>  arch/riscv/mm/fault.c     | 4 ++++
>  arch/s390/mm/fault.c      | 4 ++++
>  arch/x86/mm/fault.c       | 4 ++++

If only we could unify all these paths :(

>  include/linux/mm_types.h  | 9 +++++----
>  mm/filemap.c              | 5 ++++-
>  9 files changed, 39 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index b71625378ce3..12b2d65ef1b9 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1670,10 +1670,11 @@ enum vm_fault_reason {
>  	VM_FAULT_NOPAGE         = (__force vm_fault_t)0x000100,
>  	VM_FAULT_LOCKED         = (__force vm_fault_t)0x000200,
>  	VM_FAULT_RETRY          = (__force vm_fault_t)0x000400,
> -	VM_FAULT_FALLBACK       = (__force vm_fault_t)0x000800,
> -	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
> -	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
> -	VM_FAULT_COMPLETED      = (__force vm_fault_t)0x004000,
> +	VM_FAULT_RETRY_VMA      = (__force vm_fault_t)0x000800,

So, what I am wondering here is why we need one more fault flag versus
just blindly doing this on a plain-old RETRY. Is there any particular
reason why? I can't think of one. 

I would also like to see performance numbers.

The rest of the patch looks OK to me.

-- 
Pedro


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] mm/filemap: Retry fault by VMA lock if the lock was released for I/O
  2025-11-27 10:52   ` Pedro Falcato
@ 2025-11-27 11:39     ` Barry Song
  2025-11-27 16:26       ` Pedro Falcato
  0 siblings, 1 reply; 15+ messages in thread
From: Barry Song @ 2025-11-27 11:39 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: akpm, linux-mm, Oven Liyang, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Matthew Wilcox,
	Jarkko Sakkinen, Oscar Salvador, Kuninori Morimoto, Mark Rutland,
	Ada Couprie Diaz, Robin Murphy, Kristina Martšenko,
	Kevin Brodsky, Yeoreum Yun, Wentao Guan, Thorsten Blum,
	Steven Rostedt, Yunhui Cui, Nam Cao, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	linux-fsdevel, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song

On Thu, Nov 27, 2025 at 6:52 PM Pedro Falcato <pfalcato@suse.de> wrote:
>
> On Thu, Nov 27, 2025 at 09:14:37AM +0800, Barry Song wrote:
> > From: Oven Liyang <liyangouwen1@oppo.com>
> >
> > If the current page fault is using the per-VMA lock, and we only released
> > the lock to wait for I/O completion (e.g., using folio_lock()), then when
> > the fault is retried after the I/O completes, it should still qualify for
> > the per-VMA-lock path.
> >
> <snip>
> > Signed-off-by: Oven Liyang <liyangouwen1@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  arch/arm/mm/fault.c       | 5 +++++
> >  arch/arm64/mm/fault.c     | 5 +++++
> >  arch/loongarch/mm/fault.c | 4 ++++
> >  arch/powerpc/mm/fault.c   | 5 ++++-
> >  arch/riscv/mm/fault.c     | 4 ++++
> >  arch/s390/mm/fault.c      | 4 ++++
> >  arch/x86/mm/fault.c       | 4 ++++
>
> If only we could unify all these paths :(

Right, it’s a pain, but we do have bots for that?
And it’s basically just copy-and-paste across different architectures.

>
> >  include/linux/mm_types.h  | 9 +++++----
> >  mm/filemap.c              | 5 ++++-
> >  9 files changed, 39 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index b71625378ce3..12b2d65ef1b9 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -1670,10 +1670,11 @@ enum vm_fault_reason {
> >       VM_FAULT_NOPAGE         = (__force vm_fault_t)0x000100,
> >       VM_FAULT_LOCKED         = (__force vm_fault_t)0x000200,
> >       VM_FAULT_RETRY          = (__force vm_fault_t)0x000400,
> > -     VM_FAULT_FALLBACK       = (__force vm_fault_t)0x000800,
> > -     VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
> > -     VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
> > -     VM_FAULT_COMPLETED      = (__force vm_fault_t)0x004000,
> > +     VM_FAULT_RETRY_VMA      = (__force vm_fault_t)0x000800,
>
> So, what I am wondering here is why we need one more fault flag versus
> just blindly doing this on a plain-old RETRY. Is there any particular
> reason why? I can't think of one.

Because in some cases we retry simply due to needing to take mmap_lock.
For example:

/**
 * __vmf_anon_prepare - Prepare to handle an anonymous fault.
 * @vmf: The vm_fault descriptor passed from the fault handler.
 *
 * When preparing to insert an anonymous page into a VMA from a
 * fault handler, call this function rather than anon_vma_prepare().
 * If this vma does not already have an associated anon_vma and we are
 * only protected by the per-VMA lock, the caller must retry with the
 * mmap_lock held.  __anon_vma_prepare() will look at adjacent VMAs to
 * determine if this VMA can share its anon_vma, and that's not safe to
 * do with only the per-VMA lock held for this VMA.
 *
 * Return: 0 if fault handling can proceed.  Any other value should be
 * returned to the caller.
 */
vm_fault_t __vmf_anon_prepare(struct vm_fault *vmf)
{
...
}

Thus, we have to check each branch one by one, but I/O wait is the most
frequent path, so we handle it first.

>
> I would also like to see performance numbers.

Yes. From what I understand, this patchset should improve performance in a
fairly straightforward way.
But yes, I can certainly include some data in v2.

>
> The rest of the patch looks OK to me.

Thanks
Barry


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] mm/filemap: Retry fault by VMA lock if the lock was released for I/O
  2025-11-27 11:39     ` Barry Song
@ 2025-11-27 16:26       ` Pedro Falcato
  0 siblings, 0 replies; 15+ messages in thread
From: Pedro Falcato @ 2025-11-27 16:26 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, Oven Liyang, Russell King, Catalin Marinas,
	Will Deacon, Huacai Chen, WANG Xuerui, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Matthew Wilcox,
	Jarkko Sakkinen, Oscar Salvador, Kuninori Morimoto, Mark Rutland,
	Ada Couprie Diaz, Robin Murphy, Kristina Martšenko,
	Kevin Brodsky, Yeoreum Yun, Wentao Guan, Thorsten Blum,
	Steven Rostedt, Yunhui Cui, Nam Cao, linux-arm-kernel,
	linux-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390,
	linux-fsdevel, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song

On Thu, Nov 27, 2025 at 07:39:11PM +0800, Barry Song wrote:
> On Thu, Nov 27, 2025 at 6:52 PM Pedro Falcato <pfalcato@suse.de> wrote:
> >
> > On Thu, Nov 27, 2025 at 09:14:37AM +0800, Barry Song wrote:
> > > From: Oven Liyang <liyangouwen1@oppo.com>
> > >
> > > If the current page fault is using the per-VMA lock, and we only released
> > > the lock to wait for I/O completion (e.g., using folio_lock()), then when
> > > the fault is retried after the I/O completes, it should still qualify for
> > > the per-VMA-lock path.
> > >
> > <snip>
> > > Signed-off-by: Oven Liyang <liyangouwen1@oppo.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  arch/arm/mm/fault.c       | 5 +++++
> > >  arch/arm64/mm/fault.c     | 5 +++++
> > >  arch/loongarch/mm/fault.c | 4 ++++
> > >  arch/powerpc/mm/fault.c   | 5 ++++-
> > >  arch/riscv/mm/fault.c     | 4 ++++
> > >  arch/s390/mm/fault.c      | 4 ++++
> > >  arch/x86/mm/fault.c       | 4 ++++
> >
> > If only we could unify all these paths :(
> 
> Right, it’s a pain, but we do have bots for that?
> And it’s basically just copy-and-paste across different architectures.
> 
> >
> > >  include/linux/mm_types.h  | 9 +++++----
> > >  mm/filemap.c              | 5 ++++-
> > >  9 files changed, 39 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index b71625378ce3..12b2d65ef1b9 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -1670,10 +1670,11 @@ enum vm_fault_reason {
> > >       VM_FAULT_NOPAGE         = (__force vm_fault_t)0x000100,
> > >       VM_FAULT_LOCKED         = (__force vm_fault_t)0x000200,
> > >       VM_FAULT_RETRY          = (__force vm_fault_t)0x000400,
> > > -     VM_FAULT_FALLBACK       = (__force vm_fault_t)0x000800,
> > > -     VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
> > > -     VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
> > > -     VM_FAULT_COMPLETED      = (__force vm_fault_t)0x004000,
> > > +     VM_FAULT_RETRY_VMA      = (__force vm_fault_t)0x000800,
> >
> > So, what I am wondering here is why we need one more fault flag versus
> > just blindly doing this on a plain-old RETRY. Is there any particular
> > reason why? I can't think of one.
> 
> Because in some cases we retry simply due to needing to take mmap_lock.
> For example:
> 
> /**
>  * __vmf_anon_prepare - Prepare to handle an anonymous fault.
>  * @vmf: The vm_fault descriptor passed from the fault handler.
>  *
>  * When preparing to insert an anonymous page into a VMA from a
>  * fault handler, call this function rather than anon_vma_prepare().
>  * If this vma does not already have an associated anon_vma and we are
>  * only protected by the per-VMA lock, the caller must retry with the
>  * mmap_lock held.  __anon_vma_prepare() will look at adjacent VMAs to
>  * determine if this VMA can share its anon_vma, and that's not safe to
>  * do with only the per-VMA lock held for this VMA.
>  *
>  * Return: 0 if fault handling can proceed.  Any other value should be
>  * returned to the caller.
>  */
> vm_fault_t __vmf_anon_prepare(struct vm_fault *vmf)
> {
> ...
> }
> 
> Thus, we have to check each branch one by one, but I/O wait is the most
> frequent path, so we handle it first.
>

Hmm, right, good point. I think this is the safest option then.

FWIW:
Acked-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O
  2025-11-27  4:22   ` Barry Song
  2025-11-27  4:42     ` Barry Song
@ 2025-11-27 19:43     ` Matthew Wilcox
  2025-11-27 20:29       ` Barry Song
  2025-11-30  5:38       ` Shakeel Butt
  1 sibling, 2 replies; 15+ messages in thread
From: Matthew Wilcox @ 2025-11-27 19:43 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, linux-fsdevel

[dropping individuals, leaving only mailing lists.  please don't send
this kind of thing to so many people in future]

On Thu, Nov 27, 2025 at 12:22:16PM +0800, Barry Song wrote:
> On Thu, Nov 27, 2025 at 12:09 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Nov 27, 2025 at 09:14:36AM +0800, Barry Song wrote:
> > > There is no need to always fall back to mmap_lock if the per-VMA
> > > lock was released only to wait for pagecache or swapcache to
> > > become ready.
> >
> > Something I've been wondering about is removing all the "drop the MM
> > locks while we wait for I/O" gunk.  It's a nice amount of code removed:
> 
> I think the point is that page fault handlers should avoid holding the VMA
> lock or mmap_lock for too long while waiting for I/O. Otherwise, those
> writers and readers will be stuck for a while.

There's a usecase some of us have been discussing off-list for a few
weeks that our current strategy pessimises.  It's a process with
thousands (maybe tens of thousands) of threads.  It has much more mapped
files than it has memory that cgroups will allow it to use.  So on a
page fault, we drop the vma lock, allocate a page of ram, kick off the
read, sleep waiting for the folio to come uptodate, once it is return,
expecting the page to still be there when we reenter filemap_fault.
But it's under so much memory pressure that it's already been reclaimed
by the time we get back to it.  So all the threads just batter the
storage re-reading data.

If we don't drop the vma lock, we can insert the pages in the page table
and return, maybe getting some work done before this thread is
descheduled.

This use case also manages to get utterly hung-up trying to do reclaim
today with the mmap_lock held.  SO it manifests somewhat similarly to
your problem (everybody ends up blocked on mmap_lock) but it has a
rather different root cause.

> I agree there’s room for improvement, but merely removing the "drop the MM
> locks while waiting for I/O" code is unlikely to improve performance.

I'm not sure it'd hurt performance.  The "drop mmap locks for I/O" code
was written before the VMA locking code was written.  I don't know that
it's actually helping these days.

> The change would be much more complex, so I’d prefer to land the current
> patchset first. At least this way, we avoid falling back to mmap_lock and
> causing contention or priority inversion, with minimal changes.

Uh, this is an RFC patchset.  I'm giving you my comment, which is that I
don't think this is the right direction to go in.  Any talk of "landing"
these patches is extremely premature.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O
  2025-11-27 19:43     ` Matthew Wilcox
@ 2025-11-27 20:29       ` Barry Song
  2025-11-27 21:52         ` Barry Song
  2025-11-30  0:28         ` Suren Baghdasaryan
  2025-11-30  5:38       ` Shakeel Butt
  1 sibling, 2 replies; 15+ messages in thread
From: Barry Song @ 2025-11-27 20:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, linux-fsdevel

On Fri, Nov 28, 2025 at 3:43 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> [dropping individuals, leaving only mailing lists.  please don't send
> this kind of thing to so many people in future]
>
> On Thu, Nov 27, 2025 at 12:22:16PM +0800, Barry Song wrote:
> > On Thu, Nov 27, 2025 at 12:09 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Nov 27, 2025 at 09:14:36AM +0800, Barry Song wrote:
> > > > There is no need to always fall back to mmap_lock if the per-VMA
> > > > lock was released only to wait for pagecache or swapcache to
> > > > become ready.
> > >
> > > Something I've been wondering about is removing all the "drop the MM
> > > locks while we wait for I/O" gunk.  It's a nice amount of code removed:
> >
> > I think the point is that page fault handlers should avoid holding the VMA
> > lock or mmap_lock for too long while waiting for I/O. Otherwise, those
> > writers and readers will be stuck for a while.
>
> There's a usecase some of us have been discussing off-list for a few
> weeks that our current strategy pessimises.  It's a process with
> thousands (maybe tens of thousands) of threads.  It has much more mapped
> files than it has memory that cgroups will allow it to use.  So on a
> page fault, we drop the vma lock, allocate a page of ram, kick off the
> read, sleep waiting for the folio to come uptodate, once it is return,
> expecting the page to still be there when we reenter filemap_fault.
> But it's under so much memory pressure that it's already been reclaimed
> by the time we get back to it.  So all the threads just batter the
> storage re-reading data.

Is this entirely the fault of re-entering the page fault? Under extreme
memory pressure, even if we map the pages, they can still be reclaimed
quickly?

>
> If we don't drop the vma lock, we can insert the pages in the page table
> and return, maybe getting some work done before this thread is
> descheduled.

If we need to protect the page from being reclaimed too early, the fix
should reside within LRU management, not in page fault handling.

Also, I gave an example where we may not drop the VMA lock if the folio is
already up to date. That likely corresponds to waiting for the PTE mapping to
complete.

>
> This use case also manages to get utterly hung-up trying to do reclaim
> today with the mmap_lock held.  SO it manifests somewhat similarly to
> your problem (everybody ends up blocked on mmap_lock) but it has a
> rather different root cause.
>
> > I agree there’s room for improvement, but merely removing the "drop the MM
> > locks while waiting for I/O" code is unlikely to improve performance.
>
> I'm not sure it'd hurt performance.  The "drop mmap locks for I/O" code
> was written before the VMA locking code was written.  I don't know that
> it's actually helping these days.

I am concerned that other write paths may still need to modify the VMA, for
example during splitting. Tail latency has long been a significant issue for
Android users, and we have observed it even with folio_lock, which has much
finer granularity than the VMA lock.

>
> > The change would be much more complex, so I’d prefer to land the current
> > patchset first. At least this way, we avoid falling back to mmap_lock and
> > causing contention or priority inversion, with minimal changes.
>
> Uh, this is an RFC patchset.  I'm giving you my comment, which is that I
> don't think this is the right direction to go in.  Any talk of "landing"
> these patches is extremely premature.

While I agree that there are other approaches worth exploring, I
remain entirely unconvinced that this patchset is the wrong
direction. With the current retry logic, it substantially reduces
mmap_lock acquisitions and represents a clear low-hanging fruit.

Also, I am not referring to landing the RFC itself, but to a subsequent formal
patchset that retries using the per-VMA lock.

Thanks
Barry


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O
  2025-11-27 20:29       ` Barry Song
@ 2025-11-27 21:52         ` Barry Song
  2025-11-30  0:28         ` Suren Baghdasaryan
  1 sibling, 0 replies; 15+ messages in thread
From: Barry Song @ 2025-11-27 21:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, linux-arm-kernel, linux-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, linux-fsdevel

On Fri, Nov 28, 2025 at 4:29 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Nov 28, 2025 at 3:43 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > [dropping individuals, leaving only mailing lists.  please don't send
> > this kind of thing to so many people in future]

Apologies, I missed this one.

The output comes from ./scripts/get_maintainer.pl. If you think the group is
too large, I guess we should at least include Suren, Lorenzo, David, and
a few others in the discussion?

[...]

>
> >
> > This use case also manages to get utterly hung-up trying to do reclaim
> > today with the mmap_lock held.  SO it manifests somewhat similarly to
> > your problem (everybody ends up blocked on mmap_lock) but it has a
> > rather different root cause.

If I understand the use case correctly, I believe retrying with the per-VMA
lock would also be very helpful. Previously, we always retried using
mmap_lock, which can be difficult to acquire under heavy contention, leading
to long latency while the pages might be reclaimed. With the per-VMA lock, it
is much easier to hold and proceed with the work.

Thanks
Barry


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O
  2025-11-27 20:29       ` Barry Song
  2025-11-27 21:52         ` Barry Song
@ 2025-11-30  0:28         ` Suren Baghdasaryan
  2025-11-30  2:56           ` Barry Song
  1 sibling, 1 reply; 15+ messages in thread
From: Suren Baghdasaryan @ 2025-11-30  0:28 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, akpm, linux-mm, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-fsdevel

On Thu, Nov 27, 2025 at 2:29 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Nov 28, 2025 at 3:43 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > [dropping individuals, leaving only mailing lists.  please don't send
> > this kind of thing to so many people in future]
> >
> > On Thu, Nov 27, 2025 at 12:22:16PM +0800, Barry Song wrote:
> > > On Thu, Nov 27, 2025 at 12:09 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Thu, Nov 27, 2025 at 09:14:36AM +0800, Barry Song wrote:
> > > > > There is no need to always fall back to mmap_lock if the per-VMA
> > > > > lock was released only to wait for pagecache or swapcache to
> > > > > become ready.
> > > >
> > > > Something I've been wondering about is removing all the "drop the MM
> > > > locks while we wait for I/O" gunk.  It's a nice amount of code removed:
> > >
> > > I think the point is that page fault handlers should avoid holding the VMA
> > > lock or mmap_lock for too long while waiting for I/O. Otherwise, those
> > > writers and readers will be stuck for a while.
> >
> > There's a usecase some of us have been discussing off-list for a few
> > weeks that our current strategy pessimises.  It's a process with
> > thousands (maybe tens of thousands) of threads.  It has much more mapped
> > files than it has memory that cgroups will allow it to use.  So on a
> > page fault, we drop the vma lock, allocate a page of ram, kick off the
> > read, sleep waiting for the folio to come uptodate, once it is return,
> > expecting the page to still be there when we reenter filemap_fault.
> > But it's under so much memory pressure that it's already been reclaimed
> > by the time we get back to it.  So all the threads just batter the
> > storage re-reading data.
>
> Is this entirely the fault of re-entering the page fault? Under extreme
> memory pressure, even if we map the pages, they can still be reclaimed
> quickly?
>
> >
> > If we don't drop the vma lock, we can insert the pages in the page table
> > and return, maybe getting some work done before this thread is
> > descheduled.
>
> If we need to protect the page from being reclaimed too early, the fix
> should reside within LRU management, not in page fault handling.
>
> Also, I gave an example where we may not drop the VMA lock if the folio is
> already up to date. That likely corresponds to waiting for the PTE mapping to
> complete.
>
> >
> > This use case also manages to get utterly hung-up trying to do reclaim
> > today with the mmap_lock held.  SO it manifests somewhat similarly to
> > your problem (everybody ends up blocked on mmap_lock) but it has a
> > rather different root cause.
> >
> > > I agree there’s room for improvement, but merely removing the "drop the MM
> > > locks while waiting for I/O" code is unlikely to improve performance.
> >
> > I'm not sure it'd hurt performance.  The "drop mmap locks for I/O" code
> > was written before the VMA locking code was written.  I don't know that
> > it's actually helping these days.
>
> I am concerned that other write paths may still need to modify the VMA, for
> example during splitting. Tail latency has long been a significant issue for
> Android users, and we have observed it even with folio_lock, which has much
> finer granularity than the VMA lock.

Another corner case we need to consider is when there is a large VMA
covering most of the address space, so holding a VMA lock during IO
would resemble holding an mmap_lock, leading to the same issue that we
faced before "drop mmap locks for I/O". We discussed this with Matthew
in the context of the problem he mentioned (the page is reclaimed
before page fault retry happens) with no conclusion yet.

>
> >
> > > The change would be much more complex, so I’d prefer to land the current
> > > patchset first. At least this way, we avoid falling back to mmap_lock and
> > > causing contention or priority inversion, with minimal changes.
> >
> > Uh, this is an RFC patchset.  I'm giving you my comment, which is that I
> > don't think this is the right direction to go in.  Any talk of "landing"
> > these patches is extremely premature.
>
> While I agree that there are other approaches worth exploring, I
> remain entirely unconvinced that this patchset is the wrong
> direction. With the current retry logic, it substantially reduces
> mmap_lock acquisitions and represents a clear low-hanging fruit.
>
> Also, I am not referring to landing the RFC itself, but to a subsequent formal
> patchset that retries using the per-VMA lock.

I don't know if this direction is the right one but I agree with
Matthew that we should consider alternatives before adopting a new
direction. Hopefully we can find one fix for both problems rather than
fixing each one in isolation.

>
> Thanks
> Barry
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O
  2025-11-30  0:28         ` Suren Baghdasaryan
@ 2025-11-30  2:56           ` Barry Song
  0 siblings, 0 replies; 15+ messages in thread
From: Barry Song @ 2025-11-30  2:56 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, akpm, linux-mm, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-fsdevel

On Sun, Nov 30, 2025 at 8:28 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Nov 27, 2025 at 2:29 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Fri, Nov 28, 2025 at 3:43 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > [dropping individuals, leaving only mailing lists.  please don't send
> > > this kind of thing to so many people in future]
> > >
> > > On Thu, Nov 27, 2025 at 12:22:16PM +0800, Barry Song wrote:
> > > > On Thu, Nov 27, 2025 at 12:09 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Thu, Nov 27, 2025 at 09:14:36AM +0800, Barry Song wrote:
> > > > > > There is no need to always fall back to mmap_lock if the per-VMA
> > > > > > lock was released only to wait for pagecache or swapcache to
> > > > > > become ready.
> > > > >
> > > > > Something I've been wondering about is removing all the "drop the MM
> > > > > locks while we wait for I/O" gunk.  It's a nice amount of code removed:
> > > >
> > > > I think the point is that page fault handlers should avoid holding the VMA
> > > > lock or mmap_lock for too long while waiting for I/O. Otherwise, those
> > > > writers and readers will be stuck for a while.
> > >
> > > There's a usecase some of us have been discussing off-list for a few
> > > weeks that our current strategy pessimises.  It's a process with
> > > thousands (maybe tens of thousands) of threads.  It has much more mapped
> > > files than it has memory that cgroups will allow it to use.  So on a
> > > page fault, we drop the vma lock, allocate a page of ram, kick off the
> > > read, sleep waiting for the folio to come uptodate, once it is return,
> > > expecting the page to still be there when we reenter filemap_fault.
> > > But it's under so much memory pressure that it's already been reclaimed
> > > by the time we get back to it.  So all the threads just batter the
> > > storage re-reading data.
> >
> > Is this entirely the fault of re-entering the page fault? Under extreme
> > memory pressure, even if we map the pages, they can still be reclaimed
> > quickly?
> >
> > >
> > > If we don't drop the vma lock, we can insert the pages in the page table
> > > and return, maybe getting some work done before this thread is
> > > descheduled.
> >
> > If we need to protect the page from being reclaimed too early, the fix
> > should reside within LRU management, not in page fault handling.
> >
> > Also, I gave an example where we may not drop the VMA lock if the folio is
> > already up to date. That likely corresponds to waiting for the PTE mapping to
> > complete.
> >
> > >
> > > This use case also manages to get utterly hung-up trying to do reclaim
> > > today with the mmap_lock held.  SO it manifests somewhat similarly to
> > > your problem (everybody ends up blocked on mmap_lock) but it has a
> > > rather different root cause.
> > >
> > > > I agree there’s room for improvement, but merely removing the "drop the MM
> > > > locks while waiting for I/O" code is unlikely to improve performance.
> > >
> > > I'm not sure it'd hurt performance.  The "drop mmap locks for I/O" code
> > > was written before the VMA locking code was written.  I don't know that
> > > it's actually helping these days.
> >
> > I am concerned that other write paths may still need to modify the VMA, for
> > example during splitting. Tail latency has long been a significant issue for
> > Android users, and we have observed it even with folio_lock, which has much
> > finer granularity than the VMA lock.
>
> Another corner case we need to consider is when there is a large VMA
> covering most of the address space, so holding a VMA lock during IO
> would resemble holding an mmap_lock, leading to the same issue that we
> faced before "drop mmap locks for I/O". We discussed this with Matthew
> in the context of the problem he mentioned (the page is reclaimed
> before page fault retry happens) with no conclusion yet.

Suren, thank you very much for your input.

Right. I think we may discover more corner cases on Android in places
where we previously saw VMA merging, such as between two native heap
mmap areas. This can happen fairly often, and we don’t want long BIO
queues to block those writers.

>
> >
> > >
> > > > The change would be much more complex, so I’d prefer to land the current
> > > > patchset first. At least this way, we avoid falling back to mmap_lock and
> > > > causing contention or priority inversion, with minimal changes.
> > >
> > > Uh, this is an RFC patchset.  I'm giving you my comment, which is that I
> > > don't think this is the right direction to go in.  Any talk of "landing"
> > > these patches is extremely premature.
> >
> > While I agree that there are other approaches worth exploring, I
> > remain entirely unconvinced that this patchset is the wrong
> > direction. With the current retry logic, it substantially reduces
> > mmap_lock acquisitions and represents a clear low-hanging fruit.
> >
> > Also, I am not referring to landing the RFC itself, but to a subsequent formal
> > patchset that retries using the per-VMA lock.
>
> I don't know if this direction is the right one but I agree with
> Matthew that we should consider alternatives before adopting a new
> direction. Hopefully we can find one fix for both problems rather than
> fixing each one in isolation.

As I mentioned in a follow-up reply to Matthew[1], I think the current
approach also helps in cases where pages are reclaimed during retries.
Previously, we required mmap_lock to retry, so any contention made it
hard to acquire and introduced high latency. During that time, pages
could be reclaimed before mmap_lock was obtained. Now that we only
require the per-VMA lock, retries can proceed much more easily than
before.
As long as we replace a big lock with a smaller one, there is less
chance of getting stuck in D state.

If either you or Matthew have a reproducer for this issue, I’d be
happy to try it out.

BTW, we also observed mmap_lock contention during MGLRU aging. TBH, the
non-RMAP clearing of the PTE young bit does not seem helpful on arm64,
which does not support non-leaf young bits at all. After disabling the
feature below, we found that reclamation used less CPU and ran better.

echo 1 >/sys/kernel/mm/lru_gen/enabled

0x0002 Clearing the accessed bit in leaf page table entries in large
       batches, when MMU sets it (e.g., on x86). This behavior can
       theoretically worsen lock contention (mmap_lock). If it is
       disabled, the multi-gen LRU will suffer a minor performance
       degradation for workloads that contiguously map hot pages,
       whose accessed bits can be otherwise cleared by fewer larger
       batches.

[1] https://lore.kernel.org/linux-mm/CAGsJ_4wvaieWtTrK+koM3SFu9rDExkVHX5eUwYiEotVqP-ndEQ@mail.gmail.com/

Thanks
Barry


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O
  2025-11-27 19:43     ` Matthew Wilcox
  2025-11-27 20:29       ` Barry Song
@ 2025-11-30  5:38       ` Shakeel Butt
  1 sibling, 0 replies; 15+ messages in thread
From: Shakeel Butt @ 2025-11-30  5:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Barry Song, akpm, linux-mm, linux-arm-kernel, linux-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-fsdevel

On Thu, Nov 27, 2025 at 07:43:22PM +0000, Matthew Wilcox wrote:
> [dropping individuals, leaving only mailing lists.  please don't send
> this kind of thing to so many people in future]
> 
> On Thu, Nov 27, 2025 at 12:22:16PM +0800, Barry Song wrote:
> > On Thu, Nov 27, 2025 at 12:09 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Nov 27, 2025 at 09:14:36AM +0800, Barry Song wrote:
> > > > There is no need to always fall back to mmap_lock if the per-VMA
> > > > lock was released only to wait for pagecache or swapcache to
> > > > become ready.
> > >
> > > Something I've been wondering about is removing all the "drop the MM
> > > locks while we wait for I/O" gunk.  It's a nice amount of code removed:
> > 
> > I think the point is that page fault handlers should avoid holding the VMA
> > lock or mmap_lock for too long while waiting for I/O. Otherwise, those
> > writers and readers will be stuck for a while.
> 
> There's a usecase some of us have been discussing off-list for a few
> weeks that our current strategy pessimises.  It's a process with
> thousands (maybe tens of thousands) of threads.  It has much more mapped
> files than it has memory that cgroups will allow it to use.  So on a
> page fault, we drop the vma lock, allocate a page of ram, kick off the
> read, sleep waiting for the folio to come uptodate, once it is return,
> expecting the page to still be there when we reenter filemap_fault.
> But it's under so much memory pressure that it's already been reclaimed
> by the time we get back to it.  So all the threads just batter the
> storage re-reading data.

I would caution against changing kernel for such usecase. Actually I
would call it a misconfigured system instead of a usecase. If a
workload is under that much memory pressure that its refaulted pages
are getting reclaimed then it means its workingset is larger than the
available memory and is thrashing. The only option here is to either
increase the memory limits or kill the workload and reschedule on the
system with enough memory available.



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-11-30  5:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-27  1:14 [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O Barry Song
2025-11-27  1:14 ` [RFC PATCH 1/2] mm/filemap: Retry fault by VMA lock if the lock was released for I/O Barry Song
2025-11-27 10:52   ` Pedro Falcato
2025-11-27 11:39     ` Barry Song
2025-11-27 16:26       ` Pedro Falcato
2025-11-27  1:14 ` [RFC PATCH 2/2] mm/swapin: Retry swapin " Barry Song
2025-11-27  4:09 ` [RFC PATCH 0/2] mm: continue using per-VMA lock when retrying page faults after I/O Matthew Wilcox
2025-11-27  4:22   ` Barry Song
2025-11-27  4:42     ` Barry Song
2025-11-27 19:43     ` Matthew Wilcox
2025-11-27 20:29       ` Barry Song
2025-11-27 21:52         ` Barry Song
2025-11-30  0:28         ` Suren Baghdasaryan
2025-11-30  2:56           ` Barry Song
2025-11-30  5:38       ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox