* [PATCH v4 0/6] Per-VMA lock support for swap and userfaults
@ 2023-06-28 7:17 Suren Baghdasaryan
2023-06-28 7:17 ` [PATCH v4 1/6] swap: remove remnants of polling from read_swap_cache_async Suren Baghdasaryan
` (5 more replies)
0 siblings, 6 replies; 15+ messages in thread
From: Suren Baghdasaryan @ 2023-06-28 7:17 UTC (permalink / raw)
To: akpm
Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
michel, liam.howlett, jglisse, vbabka, minchan, dave,
punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
surenb, linux-mm, linux-fsdevel, linux-kernel, kernel-team
When per-VMA locks were introduced in [1] several types of page faults
would still fall back to mmap_lock to keep the patchset simple. Among them
are swap and userfault pages. The main reason for skipping those cases was
the fact that mmap_lock could be dropped while handling these faults and
that required additional logic to be implemented.
Implement the mechanism to allow per-VMA locks to be dropped for these
cases.
First, change handle_mm_fault to drop per-VMA locks when returning
VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
and return vm_fault_t which simplifies later patches. Finally allow swap
and uffd page faults to be handled under per-VMA locks by dropping per-VMA
and retrying, the same way it's done under mmap_lock.
Naturally, once VMA lock is dropped that VMA should be assumed unstable
and can't be used.
Changes since v3 posted at [2]
- Renamed folio_lock_or_retry back to folio_lock_fault, per Peter Xu
- Moved per-VMA lock release to where VM_FAULT_RETRY is returned,
per Peter Xu
- Dropped FAULT_FLAG_LOCK_DROPPED usage, per Peter Xu
- Introduced release_fault_lock() helper function, per Peter Xu
- Dropped the patch releasing per-VMA lock before migration_entry_wait,
per Peter Xu
- Introduced assert_fault_locked() helper function, per Peter Xu
- Added BUG_ON to prevent FAULT_FLAG_RETRY_NOWAIT usage with per-VMA locks
Note: patch 3/8 will cause a trivial merge conflict in arch/arm64/mm/fault.c
when applied over mm-unstable branch due to a patch from ARM64 tree [3]
which is missing in mm-unstable.
[1] https://lore.kernel.org/all/20230227173632.3292573-1-surenb@google.com/
[2] https://lore.kernel.org/all/20230627042321.1763765-1-surenb@google.com/
[3] https://lore.kernel.org/all/20230524131305.2808-1-jszhang@kernel.org/
Suren Baghdasaryan (6):
swap: remove remnants of polling from read_swap_cache_async
mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED
mm: drop per-VMA lock when returning VM_FAULT_RETRY or
VM_FAULT_COMPLETED
mm: change folio_lock_or_retry to use vm_fault directly
mm: handle swap page faults under per-VMA lock
mm: handle userfaults under VMA lock
arch/arm64/mm/fault.c | 3 ++-
arch/powerpc/mm/fault.c | 3 ++-
arch/s390/mm/fault.c | 3 ++-
arch/x86/mm/fault.c | 3 ++-
fs/userfaultfd.c | 39 ++++++++++++++++++---------------------
include/linux/mm.h | 39 +++++++++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 3 ++-
include/linux/pagemap.h | 9 ++++-----
mm/filemap.c | 37 +++++++++++++++++++------------------
mm/madvise.c | 4 ++--
mm/memory.c | 38 ++++++++++++++++----------------------
mm/swap.h | 1 -
mm/swap_state.c | 12 +++++-------
13 files changed, 113 insertions(+), 81 deletions(-)
--
2.41.0.162.gfafddb0af9-goog
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v4 1/6] swap: remove remnants of polling from read_swap_cache_async
2023-06-28 7:17 [PATCH v4 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
@ 2023-06-28 7:17 ` Suren Baghdasaryan
2023-06-28 7:17 ` [PATCH v4 2/6] mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED Suren Baghdasaryan
` (4 subsequent siblings)
5 siblings, 0 replies; 15+ messages in thread
From: Suren Baghdasaryan @ 2023-06-28 7:17 UTC (permalink / raw)
To: akpm
Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
michel, liam.howlett, jglisse, vbabka, minchan, dave,
punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
surenb, linux-mm, linux-fsdevel, linux-kernel, kernel-team,
Christoph Hellwig
Commit [1] introduced IO polling support duding swapin to reduce
swap read latency for block devices that can be polled. However later
commit [2] removed polling support. Therefore it seems safe to remove
do_poll parameter in read_swap_cache_async and always call swap_readpage
with synchronous=false waiting for IO completion in folio_lock_or_retry.
[1] commit 23955622ff8d ("swap: add block io poll in swapin path")
[2] commit 9650b453a3d4 ("block: ignore RWF_HIPRI hint for sync dio")
Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
mm/madvise.c | 4 ++--
mm/swap.h | 1 -
mm/swap_state.c | 12 +++++-------
3 files changed, 7 insertions(+), 10 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index b5ffbaf616f5..b1e8adf1234e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -215,7 +215,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
continue;
page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
- vma, index, false, &splug);
+ vma, index, &splug);
if (page)
put_page(page);
}
@@ -252,7 +252,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
rcu_read_unlock();
page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
- NULL, 0, false, &splug);
+ NULL, 0, &splug);
if (page)
put_page(page);
diff --git a/mm/swap.h b/mm/swap.h
index 7c033d793f15..8a3c7a0ace4f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -46,7 +46,6 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma,
unsigned long addr,
- bool do_poll,
struct swap_iocb **plug);
struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b76a65ac28b3..a3839de71f3f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -517,15 +517,14 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
*/
struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma,
- unsigned long addr, bool do_poll,
- struct swap_iocb **plug)
+ unsigned long addr, struct swap_iocb **plug)
{
bool page_was_allocated;
struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
vma, addr, &page_was_allocated);
if (page_was_allocated)
- swap_readpage(retpage, do_poll, plug);
+ swap_readpage(retpage, false, plug);
return retpage;
}
@@ -620,7 +619,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
struct swap_iocb *splug = NULL;
- bool do_poll = true, page_allocated;
+ bool page_allocated;
struct vm_area_struct *vma = vmf->vma;
unsigned long addr = vmf->address;
@@ -628,7 +627,6 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (!mask)
goto skip;
- do_poll = false;
/* Read a page_cluster sized and aligned cluster around offset. */
start_offset = offset & ~mask;
end_offset = offset | mask;
@@ -660,7 +658,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
lru_add_drain(); /* Push any new pages onto the LRU now */
skip:
/* The page was likely read above, so no need for plugging here */
- return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll, NULL);
+ return read_swap_cache_async(entry, gfp_mask, vma, addr, NULL);
}
int init_swap_address_space(unsigned int type, unsigned long nr_pages)
@@ -825,7 +823,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
skip:
/* The page was likely read above, so no need for plugging here */
return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
- ra_info.win == 1, NULL);
+ NULL);
}
/**
--
2.41.0.162.gfafddb0af9-goog
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v4 2/6] mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED
2023-06-28 7:17 [PATCH v4 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
2023-06-28 7:17 ` [PATCH v4 1/6] swap: remove remnants of polling from read_swap_cache_async Suren Baghdasaryan
@ 2023-06-28 7:17 ` Suren Baghdasaryan
2023-06-28 7:17 ` [PATCH v4 3/6] mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED Suren Baghdasaryan
` (3 subsequent siblings)
5 siblings, 0 replies; 15+ messages in thread
From: Suren Baghdasaryan @ 2023-06-28 7:17 UTC (permalink / raw)
To: akpm
Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
michel, liam.howlett, jglisse, vbabka, minchan, dave,
punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
surenb, linux-mm, linux-fsdevel, linux-kernel, kernel-team
VM_FAULT_RESULT_TRACE should contain an element for every vm_fault_reason
to be used as flag_array inside trace_print_flags_seq(). The element
for VM_FAULT_COMPLETED is missing, add it.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
include/linux/mm_types.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..79765e3dd8f3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1070,7 +1070,8 @@ enum vm_fault_reason {
{ VM_FAULT_RETRY, "RETRY" }, \
{ VM_FAULT_FALLBACK, "FALLBACK" }, \
{ VM_FAULT_DONE_COW, "DONE_COW" }, \
- { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }
+ { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \
+ { VM_FAULT_COMPLETED, "COMPLETED" }
struct vm_special_mapping {
const char *name; /* The name, e.g. "[vdso]". */
--
2.41.0.162.gfafddb0af9-goog
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v4 3/6] mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED
2023-06-28 7:17 [PATCH v4 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
2023-06-28 7:17 ` [PATCH v4 1/6] swap: remove remnants of polling from read_swap_cache_async Suren Baghdasaryan
2023-06-28 7:17 ` [PATCH v4 2/6] mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED Suren Baghdasaryan
@ 2023-06-28 7:17 ` Suren Baghdasaryan
2023-06-28 13:41 ` Peter Xu
2023-06-28 7:17 ` [PATCH v4 4/6] mm: change folio_lock_or_retry to use vm_fault directly Suren Baghdasaryan
` (2 subsequent siblings)
5 siblings, 1 reply; 15+ messages in thread
From: Suren Baghdasaryan @ 2023-06-28 7:17 UTC (permalink / raw)
To: akpm
Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
michel, liam.howlett, jglisse, vbabka, minchan, dave,
punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
surenb, linux-mm, linux-fsdevel, linux-kernel, kernel-team
handle_mm_fault returning VM_FAULT_RETRY or VM_FAULT_COMPLETED means
mmap_lock has been released. However with per-VMA locks behavior is
different and the caller should still release it. To make the
rules consistent for the caller, drop the per-VMA lock when returning
VM_FAULT_RETRY or VM_FAULT_COMPLETED. Currently the only path returning
VM_FAULT_RETRY under per-VMA locks is do_swap_page and no path returns
VM_FAULT_COMPLETED for now.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
arch/arm64/mm/fault.c | 3 ++-
arch/powerpc/mm/fault.c | 3 ++-
arch/s390/mm/fault.c | 3 ++-
arch/x86/mm/fault.c | 3 ++-
mm/memory.c | 1 +
5 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index c85b6d70b222..9c06c53a9ff3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -612,7 +612,8 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
goto lock_mmap;
}
fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, regs);
- vma_end_read(vma);
+ if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
+ vma_end_read(vma);
if (!(fault & VM_FAULT_RETRY)) {
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 531177a4ee08..4697c5dca31c 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -494,7 +494,8 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
}
fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
- vma_end_read(vma);
+ if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
+ vma_end_read(vma);
if (!(fault & VM_FAULT_RETRY)) {
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index b65144c392b0..cccefe41038b 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -418,7 +418,8 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
goto lock_mmap;
}
fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
- vma_end_read(vma);
+ if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
+ vma_end_read(vma);
if (!(fault & VM_FAULT_RETRY)) {
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
goto out;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e4399983c50c..d69c85c1c04e 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1347,7 +1347,8 @@ void do_user_addr_fault(struct pt_regs *regs,
goto lock_mmap;
}
fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
- vma_end_read(vma);
+ if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
+ vma_end_read(vma);
if (!(fault & VM_FAULT_RETRY)) {
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
diff --git a/mm/memory.c b/mm/memory.c
index f69fbc251198..f14d45957b83 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3713,6 +3713,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
ret = VM_FAULT_RETRY;
+ vma_end_read(vma);
goto out;
}
--
2.41.0.162.gfafddb0af9-goog
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v4 4/6] mm: change folio_lock_or_retry to use vm_fault directly
2023-06-28 7:17 [PATCH v4 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
` (2 preceding siblings ...)
2023-06-28 7:17 ` [PATCH v4 3/6] mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED Suren Baghdasaryan
@ 2023-06-28 7:17 ` Suren Baghdasaryan
2023-06-28 13:42 ` Peter Xu
2023-06-28 7:17 ` [PATCH v4 5/6] mm: handle swap page faults under per-VMA lock Suren Baghdasaryan
2023-06-28 7:18 ` [PATCH v4 6/6] mm: handle userfaults under VMA lock Suren Baghdasaryan
5 siblings, 1 reply; 15+ messages in thread
From: Suren Baghdasaryan @ 2023-06-28 7:17 UTC (permalink / raw)
To: akpm
Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
michel, liam.howlett, jglisse, vbabka, minchan, dave,
punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
surenb, linux-mm, linux-fsdevel, linux-kernel, kernel-team
Change folio_lock_or_retry to accept vm_fault struct and return the
vm_fault_t directly.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
include/linux/pagemap.h | 9 ++++-----
mm/filemap.c | 22 ++++++++++++----------
mm/memory.c | 14 ++++++--------
3 files changed, 22 insertions(+), 23 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a56308a9d1a4..59d070c55c97 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -896,8 +896,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
void __folio_lock(struct folio *folio);
int __folio_lock_killable(struct folio *folio);
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
- unsigned int flags);
+vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf);
void unlock_page(struct page *page);
void folio_unlock(struct folio *folio);
@@ -1001,11 +1000,11 @@ static inline int folio_lock_killable(struct folio *folio)
* Return value and mmap_lock implications depend on flags; see
* __folio_lock_or_retry().
*/
-static inline bool folio_lock_or_retry(struct folio *folio,
- struct mm_struct *mm, unsigned int flags)
+static inline vm_fault_t folio_lock_or_retry(struct folio *folio,
+ struct vm_fault *vmf)
{
might_sleep();
- return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags);
+ return folio_trylock(folio) ? 0 : __folio_lock_or_retry(folio, vmf);
}
/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 00f01d8ead47..52bcf12dcdbf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1701,32 +1701,34 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
/*
* Return values:
- * true - folio is locked; mmap_lock is still held.
- * false - folio is not locked.
+ * 0 - folio is locked.
+ * VM_FAULT_RETRY - folio is not locked.
* mmap_lock has been released (mmap_read_unlock(), unless flags had both
* FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
* which case mmap_lock is still held.
*
- * If neither ALLOW_RETRY nor KILLABLE are set, will always return true
+ * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
* with the folio locked and the mmap_lock unperturbed.
*/
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
- unsigned int flags)
+vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
{
+ struct mm_struct *mm = vmf->vma->vm_mm;
+ unsigned int flags = vmf->flags;
+
if (fault_flag_allow_retry_first(flags)) {
/*
* CAUTION! In this case, mmap_lock is not released
- * even though return 0.
+ * even though return VM_FAULT_RETRY.
*/
if (flags & FAULT_FLAG_RETRY_NOWAIT)
- return false;
+ return VM_FAULT_RETRY;
mmap_read_unlock(mm);
if (flags & FAULT_FLAG_KILLABLE)
folio_wait_locked_killable(folio);
else
folio_wait_locked(folio);
- return false;
+ return VM_FAULT_RETRY;
}
if (flags & FAULT_FLAG_KILLABLE) {
bool ret;
@@ -1734,13 +1736,13 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
ret = __folio_lock_killable(folio);
if (ret) {
mmap_read_unlock(mm);
- return false;
+ return VM_FAULT_RETRY;
}
} else {
__folio_lock(folio);
}
- return true;
+ return 0;
}
/**
diff --git a/mm/memory.c b/mm/memory.c
index f14d45957b83..345080052003 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3568,6 +3568,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
struct folio *folio = page_folio(vmf->page);
struct vm_area_struct *vma = vmf->vma;
struct mmu_notifier_range range;
+ vm_fault_t ret;
/*
* We need a reference to lock the folio because we don't hold
@@ -3580,9 +3581,10 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
if (!folio_try_get(folio))
return 0;
- if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
+ ret = folio_lock_or_retry(folio, vmf);
+ if (ret) {
folio_put(folio);
- return VM_FAULT_RETRY;
+ return ret;
}
mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
vma->vm_mm, vmf->address & PAGE_MASK,
@@ -3704,7 +3706,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
bool exclusive = false;
swp_entry_t entry;
pte_t pte;
- int locked;
vm_fault_t ret = 0;
void *shadow = NULL;
@@ -3826,12 +3827,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out_release;
}
- locked = folio_lock_or_retry(folio, vma->vm_mm, vmf->flags);
-
- if (!locked) {
- ret |= VM_FAULT_RETRY;
+ ret |= folio_lock_or_retry(folio, vmf);
+ if (ret & VM_FAULT_RETRY)
goto out_release;
- }
if (swapcache) {
/*
--
2.41.0.162.gfafddb0af9-goog
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v4 5/6] mm: handle swap page faults under per-VMA lock
2023-06-28 7:17 [PATCH v4 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
` (3 preceding siblings ...)
2023-06-28 7:17 ` [PATCH v4 4/6] mm: change folio_lock_or_retry to use vm_fault directly Suren Baghdasaryan
@ 2023-06-28 7:17 ` Suren Baghdasaryan
2023-06-28 13:43 ` Peter Xu
2023-06-28 7:18 ` [PATCH v4 6/6] mm: handle userfaults under VMA lock Suren Baghdasaryan
5 siblings, 1 reply; 15+ messages in thread
From: Suren Baghdasaryan @ 2023-06-28 7:17 UTC (permalink / raw)
To: akpm
Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
michel, liam.howlett, jglisse, vbabka, minchan, dave,
punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
surenb, linux-mm, linux-fsdevel, linux-kernel, kernel-team
When page fault is handled under per-VMA lock protection, all swap page
faults are retried with mmap_lock because folio_lock_or_retry has to drop
and reacquire mmap_lock if folio could not be immediately locked.
Follow the same pattern as mmap_lock to drop per-VMA lock when waiting
for folio and retrying once folio is available.
With this obstacle removed, enable do_swap_page to operate under
per-VMA lock protection. Drivers implementing ops->migrate_to_ram might
still rely on mmap_lock, therefore we have to fall back to mmap_lock in
that particular case.
Note that the only time do_swap_page calls synchronous swap_readpage
is when SWP_SYNCHRONOUS_IO is set, which is only set for
QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and
pmem). Therefore we don't sleep in this path, and there's no need to
drop the mmap or per-VMA lock.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
mm/filemap.c | 25 ++++++++++++++++---------
mm/memory.c | 16 ++++++++++------
2 files changed, 26 insertions(+), 15 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 52bcf12dcdbf..7ee078e1a0d2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1699,31 +1699,38 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
return ret;
}
+static void release_fault_lock(struct vm_fault *vmf)
+{
+ if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+ vma_end_read(vmf->vma);
+ else
+ mmap_read_unlock(vmf->vma->vm_mm);
+}
+
/*
* Return values:
* 0 - folio is locked.
* VM_FAULT_RETRY - folio is not locked.
- * mmap_lock has been released (mmap_read_unlock(), unless flags had both
- * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
- * which case mmap_lock is still held.
+ * mmap_lock or per-VMA lock has been released (mmap_read_unlock() or
+ * vma_end_read()), unless flags had both FAULT_FLAG_ALLOW_RETRY and
+ * FAULT_FLAG_RETRY_NOWAIT set, in which case the lock is still held.
*
* If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
- * with the folio locked and the mmap_lock unperturbed.
+ * with the folio locked and the mmap_lock/per-VMA lock is left unperturbed.
*/
vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
{
- struct mm_struct *mm = vmf->vma->vm_mm;
unsigned int flags = vmf->flags;
if (fault_flag_allow_retry_first(flags)) {
/*
- * CAUTION! In this case, mmap_lock is not released
- * even though return VM_FAULT_RETRY.
+ * CAUTION! In this case, mmap_lock/per-VMA lock is not
+ * released even though returning VM_FAULT_RETRY.
*/
if (flags & FAULT_FLAG_RETRY_NOWAIT)
return VM_FAULT_RETRY;
- mmap_read_unlock(mm);
+ release_fault_lock(vmf);
if (flags & FAULT_FLAG_KILLABLE)
folio_wait_locked_killable(folio);
else
@@ -1735,7 +1742,7 @@ vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
ret = __folio_lock_killable(folio);
if (ret) {
- mmap_read_unlock(mm);
+ release_fault_lock(vmf);
return VM_FAULT_RETRY;
}
} else {
diff --git a/mm/memory.c b/mm/memory.c
index 345080052003..76c7907e7286 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3712,12 +3712,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!pte_unmap_same(vmf))
goto out;
- if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
- ret = VM_FAULT_RETRY;
- vma_end_read(vma);
- goto out;
- }
-
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
@@ -3727,6 +3721,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->page = pfn_swap_entry_to_page(entry);
ret = remove_device_exclusive_entry(vmf);
} else if (is_device_private_entry(entry)) {
+ if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+ /*
+ * migrate_to_ram is not yet ready to operate
+ * under VMA lock.
+ */
+ vma_end_read(vma);
+ ret |= VM_FAULT_RETRY;
+ goto out;
+ }
+
vmf->page = pfn_swap_entry_to_page(entry);
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
--
2.41.0.162.gfafddb0af9-goog
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v4 6/6] mm: handle userfaults under VMA lock
2023-06-28 7:17 [PATCH v4 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
` (4 preceding siblings ...)
2023-06-28 7:17 ` [PATCH v4 5/6] mm: handle swap page faults under per-VMA lock Suren Baghdasaryan
@ 2023-06-28 7:18 ` Suren Baghdasaryan
2023-06-28 7:24 ` Suren Baghdasaryan
2023-06-28 13:57 ` Peter Xu
5 siblings, 2 replies; 15+ messages in thread
From: Suren Baghdasaryan @ 2023-06-28 7:18 UTC (permalink / raw)
To: akpm
Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
michel, liam.howlett, jglisse, vbabka, minchan, dave,
punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
surenb, linux-mm, linux-fsdevel, linux-kernel, kernel-team
Enable handle_userfault to operate under VMA lock by releasing VMA lock
instead of mmap_lock and retrying. Note that FAULT_FLAG_RETRY_NOWAIT
should never be used when handling faults under per-VMA lock protection
because that would break the assumption that lock is dropped on retry.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
fs/userfaultfd.c | 39 ++++++++++++++++++---------------------
include/linux/mm.h | 39 +++++++++++++++++++++++++++++++++++++++
mm/filemap.c | 8 --------
mm/memory.c | 9 ---------
4 files changed, 57 insertions(+), 38 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 4e800bb7d2ab..d019e7df6f15 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -277,17 +277,16 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
* hugepmd ranges.
*/
static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
- struct vm_area_struct *vma,
- unsigned long address,
- unsigned long flags,
- unsigned long reason)
+ struct vm_fault *vmf,
+ unsigned long reason)
{
+ struct vm_area_struct *vma = vmf->vma;
pte_t *ptep, pte;
bool ret = true;
- mmap_assert_locked(ctx->mm);
+ assert_fault_locked(ctx->mm, vmf);
- ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
+ ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma));
if (!ptep)
goto out;
@@ -308,10 +307,8 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
}
#else
static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
- struct vm_area_struct *vma,
- unsigned long address,
- unsigned long flags,
- unsigned long reason)
+ struct vm_fault *vmf,
+ unsigned long reason)
{
return false; /* should never get here */
}
@@ -325,11 +322,11 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
* threads.
*/
static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
- unsigned long address,
- unsigned long flags,
+ struct vm_fault *vmf,
unsigned long reason)
{
struct mm_struct *mm = ctx->mm;
+ unsigned long address = vmf->address;
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
@@ -337,7 +334,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
pte_t *pte;
bool ret = true;
- mmap_assert_locked(mm);
+ assert_fault_locked(mm, vmf);
pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
@@ -445,7 +442,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
* Coredumping runs without mmap_lock so we can only check that
* the mmap_lock is held, if PF_DUMPCORE was not set.
*/
- mmap_assert_locked(mm);
+ assert_fault_locked(mm, vmf);
ctx = vma->vm_userfaultfd_ctx.ctx;
if (!ctx)
@@ -522,8 +519,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
* and wait.
*/
ret = VM_FAULT_RETRY;
- if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
+ if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
+ /* Per-VMA lock is expected to be dropped on VM_FAULT_RETRY */
+ BUG_ON(vmf->flags & FAULT_FLAG_RETRY_NOWAIT);
goto out;
+ }
/* take the reference before dropping the mmap_lock */
userfaultfd_ctx_get(ctx);
@@ -561,15 +561,12 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
spin_unlock_irq(&ctx->fault_pending_wqh.lock);
if (!is_vm_hugetlb_page(vma))
- must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
- reason);
+ must_wait = userfaultfd_must_wait(ctx, vmf, reason);
else
- must_wait = userfaultfd_huge_must_wait(ctx, vma,
- vmf->address,
- vmf->flags, reason);
+ must_wait = userfaultfd_huge_must_wait(ctx, vmf, reason);
if (is_vm_hugetlb_page(vma))
hugetlb_vma_unlock_read(vma);
- mmap_read_unlock(mm);
+ release_fault_lock(vmf);
if (likely(must_wait && !READ_ONCE(ctx->released))) {
wake_up_poll(&ctx->fd_wqh, EPOLLIN);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fec149585985..70bb2f923e33 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -705,6 +705,17 @@ static inline bool vma_try_start_write(struct vm_area_struct *vma)
return true;
}
+static inline void vma_assert_locked(struct vm_area_struct *vma)
+{
+ int mm_lock_seq;
+
+ if (__is_vma_write_locked(vma, &mm_lock_seq))
+ return;
+
+ lockdep_assert_held(&vma->vm_lock->lock);
+ VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_lock->lock), vma);
+}
+
static inline void vma_assert_write_locked(struct vm_area_struct *vma)
{
int mm_lock_seq;
@@ -723,6 +734,23 @@ static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
unsigned long address);
+static inline
+void assert_fault_locked(struct mm_struct *mm, struct vm_fault *vmf)
+{
+ if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+ vma_assert_locked(vmf->vma);
+ else
+ mmap_assert_locked(mm);
+}
+
+static inline void release_fault_lock(struct vm_fault *vmf)
+{
+ if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+ vma_end_read(vmf->vma);
+ else
+ mmap_read_unlock(vmf->vma->vm_mm);
+}
+
#else /* CONFIG_PER_VMA_LOCK */
static inline void vma_init_lock(struct vm_area_struct *vma) {}
@@ -736,6 +764,17 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
static inline void vma_mark_detached(struct vm_area_struct *vma,
bool detached) {}
+static inline
+void assert_fault_locked(struct mm_struct *mm, struct vm_fault *vmf)
+{
+ mmap_assert_locked(mm);
+}
+
+static inline void release_fault_lock(struct vm_fault *vmf)
+{
+ mmap_read_unlock(vmf->vma->vm_mm);
+}
+
#endif /* CONFIG_PER_VMA_LOCK */
/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 7ee078e1a0d2..d4d8f474e0c5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1699,14 +1699,6 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
return ret;
}
-static void release_fault_lock(struct vm_fault *vmf)
-{
- if (vmf->flags & FAULT_FLAG_VMA_LOCK)
- vma_end_read(vmf->vma);
- else
- mmap_read_unlock(vmf->vma->vm_mm);
-}
-
/*
* Return values:
* 0 - folio is locked.
diff --git a/mm/memory.c b/mm/memory.c
index 76c7907e7286..c6c759922f39 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5294,15 +5294,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
if (!vma_start_read(vma))
goto inval;
- /*
- * Due to the possibility of userfault handler dropping mmap_lock, avoid
- * it for now and fall back to page fault handling under mmap_lock.
- */
- if (userfaultfd_armed(vma)) {
- vma_end_read(vma);
- goto inval;
- }
-
/* Check since vm_start/vm_end might change before we lock the VMA */
if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
vma_end_read(vma);
--
2.41.0.162.gfafddb0af9-goog
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 6/6] mm: handle userfaults under VMA lock
2023-06-28 7:18 ` [PATCH v4 6/6] mm: handle userfaults under VMA lock Suren Baghdasaryan
@ 2023-06-28 7:24 ` Suren Baghdasaryan
2023-06-28 13:57 ` Peter Xu
1 sibling, 0 replies; 15+ messages in thread
From: Suren Baghdasaryan @ 2023-06-28 7:24 UTC (permalink / raw)
To: akpm
Cc: willy, hannes, mhocko, josef, jack, ldufour, laurent.dufour,
michel, liam.howlett, jglisse, vbabka, minchan, dave,
punit.agrawal, lstoakes, hdanton, apopple, peterx, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
linux-mm, linux-fsdevel, linux-kernel, kernel-team
On Wed, Jun 28, 2023 at 12:18 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Enable handle_userfault to operate under VMA lock by releasing VMA lock
> instead of mmap_lock and retrying. Note that FAULT_FLAG_RETRY_NOWAIT
> should never be used when handling faults under per-VMA lock protection
> because that would break the assumption that lock is dropped on retry.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
> fs/userfaultfd.c | 39 ++++++++++++++++++---------------------
> include/linux/mm.h | 39 +++++++++++++++++++++++++++++++++++++++
> mm/filemap.c | 8 --------
> mm/memory.c | 9 ---------
> 4 files changed, 57 insertions(+), 38 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 4e800bb7d2ab..d019e7df6f15 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -277,17 +277,16 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
> * hugepmd ranges.
> */
> static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> - struct vm_area_struct *vma,
> - unsigned long address,
> - unsigned long flags,
> - unsigned long reason)
> + struct vm_fault *vmf,
> + unsigned long reason)
> {
> + struct vm_area_struct *vma = vmf->vma;
> pte_t *ptep, pte;
> bool ret = true;
>
> - mmap_assert_locked(ctx->mm);
> + assert_fault_locked(ctx->mm, vmf);
>
> - ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
> + ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma));
> if (!ptep)
> goto out;
>
> @@ -308,10 +307,8 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> }
> #else
> static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> - struct vm_area_struct *vma,
> - unsigned long address,
> - unsigned long flags,
> - unsigned long reason)
> + struct vm_fault *vmf,
> + unsigned long reason)
> {
> return false; /* should never get here */
> }
> @@ -325,11 +322,11 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> * threads.
> */
> static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> - unsigned long address,
> - unsigned long flags,
> + struct vm_fault *vmf,
> unsigned long reason)
> {
> struct mm_struct *mm = ctx->mm;
> + unsigned long address = vmf->address;
> pgd_t *pgd;
> p4d_t *p4d;
> pud_t *pud;
> @@ -337,7 +334,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> pte_t *pte;
> bool ret = true;
>
> - mmap_assert_locked(mm);
> + assert_fault_locked(mm, vmf);
>
> pgd = pgd_offset(mm, address);
> if (!pgd_present(*pgd))
> @@ -445,7 +442,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> * Coredumping runs without mmap_lock so we can only check that
> * the mmap_lock is held, if PF_DUMPCORE was not set.
> */
> - mmap_assert_locked(mm);
> + assert_fault_locked(mm, vmf);
>
> ctx = vma->vm_userfaultfd_ctx.ctx;
> if (!ctx)
> @@ -522,8 +519,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> * and wait.
> */
> ret = VM_FAULT_RETRY;
> - if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> + if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
> + /* Per-VMA lock is expected to be dropped on VM_FAULT_RETRY */
> + BUG_ON(vmf->flags & FAULT_FLAG_RETRY_NOWAIT);
Sorry, this should have been:
+ BUG_ON(vmf->flags & FAULT_FLAG_VMA_LOCK);
> goto out;
> + }
>
> /* take the reference before dropping the mmap_lock */
> userfaultfd_ctx_get(ctx);
> @@ -561,15 +561,12 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>
> if (!is_vm_hugetlb_page(vma))
> - must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
> - reason);
> + must_wait = userfaultfd_must_wait(ctx, vmf, reason);
> else
> - must_wait = userfaultfd_huge_must_wait(ctx, vma,
> - vmf->address,
> - vmf->flags, reason);
> + must_wait = userfaultfd_huge_must_wait(ctx, vmf, reason);
> if (is_vm_hugetlb_page(vma))
> hugetlb_vma_unlock_read(vma);
> - mmap_read_unlock(mm);
> + release_fault_lock(vmf);
>
> if (likely(must_wait && !READ_ONCE(ctx->released))) {
> wake_up_poll(&ctx->fd_wqh, EPOLLIN);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fec149585985..70bb2f923e33 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -705,6 +705,17 @@ static inline bool vma_try_start_write(struct vm_area_struct *vma)
> return true;
> }
>
> +static inline void vma_assert_locked(struct vm_area_struct *vma)
> +{
> + int mm_lock_seq;
> +
> + if (__is_vma_write_locked(vma, &mm_lock_seq))
> + return;
> +
> + lockdep_assert_held(&vma->vm_lock->lock);
> + VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_lock->lock), vma);
> +}
> +
> static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> {
> int mm_lock_seq;
> @@ -723,6 +734,23 @@ static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
> struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> unsigned long address);
>
> +static inline
> +void assert_fault_locked(struct mm_struct *mm, struct vm_fault *vmf)
> +{
> + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> + vma_assert_locked(vmf->vma);
> + else
> + mmap_assert_locked(mm);
> +}
> +
> +static inline void release_fault_lock(struct vm_fault *vmf)
> +{
> + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> + vma_end_read(vmf->vma);
> + else
> + mmap_read_unlock(vmf->vma->vm_mm);
> +}
> +
> #else /* CONFIG_PER_VMA_LOCK */
>
> static inline void vma_init_lock(struct vm_area_struct *vma) {}
> @@ -736,6 +764,17 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
> static inline void vma_mark_detached(struct vm_area_struct *vma,
> bool detached) {}
>
> +static inline
> +void assert_fault_locked(struct mm_struct *mm, struct vm_fault *vmf)
> +{
> + mmap_assert_locked(mm);
> +}
> +
> +static inline void release_fault_lock(struct vm_fault *vmf)
> +{
> + mmap_read_unlock(vmf->vma->vm_mm);
> +}
> +
> #endif /* CONFIG_PER_VMA_LOCK */
>
> /*
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7ee078e1a0d2..d4d8f474e0c5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1699,14 +1699,6 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
> return ret;
> }
>
> -static void release_fault_lock(struct vm_fault *vmf)
> -{
> - if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> - vma_end_read(vmf->vma);
> - else
> - mmap_read_unlock(vmf->vma->vm_mm);
> -}
> -
> /*
> * Return values:
> * 0 - folio is locked.
> diff --git a/mm/memory.c b/mm/memory.c
> index 76c7907e7286..c6c759922f39 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5294,15 +5294,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> if (!vma_start_read(vma))
> goto inval;
>
> - /*
> - * Due to the possibility of userfault handler dropping mmap_lock, avoid
> - * it for now and fall back to page fault handling under mmap_lock.
> - */
> - if (userfaultfd_armed(vma)) {
> - vma_end_read(vma);
> - goto inval;
> - }
> -
> /* Check since vm_start/vm_end might change before we lock the VMA */
> if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> vma_end_read(vma);
> --
> 2.41.0.162.gfafddb0af9-goog
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 3/6] mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED
2023-06-28 7:17 ` [PATCH v4 3/6] mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED Suren Baghdasaryan
@ 2023-06-28 13:41 ` Peter Xu
0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2023-06-28 13:41 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
dave, punit.agrawal, lstoakes, hdanton, apopple, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
linux-mm, linux-fsdevel, linux-kernel, kernel-team
On Wed, Jun 28, 2023 at 12:17:57AM -0700, Suren Baghdasaryan wrote:
> handle_mm_fault returning VM_FAULT_RETRY or VM_FAULT_COMPLETED means
> mmap_lock has been released. However with per-VMA locks behavior is
> different and the caller should still release it. To make the
> rules consistent for the caller, drop the per-VMA lock when returning
> VM_FAULT_RETRY or VM_FAULT_COMPLETED. Currently the only path returning
> VM_FAULT_RETRY under per-VMA locks is do_swap_page and no path returns
> VM_FAULT_COMPLETED for now.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 4/6] mm: change folio_lock_or_retry to use vm_fault directly
2023-06-28 7:17 ` [PATCH v4 4/6] mm: change folio_lock_or_retry to use vm_fault directly Suren Baghdasaryan
@ 2023-06-28 13:42 ` Peter Xu
0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2023-06-28 13:42 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
dave, punit.agrawal, lstoakes, hdanton, apopple, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
linux-mm, linux-fsdevel, linux-kernel, kernel-team
On Wed, Jun 28, 2023 at 12:17:58AM -0700, Suren Baghdasaryan wrote:
> Change folio_lock_or_retry to accept vm_fault struct and return the
> vm_fault_t directly.
>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 5/6] mm: handle swap page faults under per-VMA lock
2023-06-28 7:17 ` [PATCH v4 5/6] mm: handle swap page faults under per-VMA lock Suren Baghdasaryan
@ 2023-06-28 13:43 ` Peter Xu
2023-06-28 16:01 ` Suren Baghdasaryan
0 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2023-06-28 13:43 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
dave, punit.agrawal, lstoakes, hdanton, apopple, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
linux-mm, linux-fsdevel, linux-kernel, kernel-team
On Wed, Jun 28, 2023 at 12:17:59AM -0700, Suren Baghdasaryan wrote:
> When page fault is handled under per-VMA lock protection, all swap page
> faults are retried with mmap_lock because folio_lock_or_retry has to drop
> and reacquire mmap_lock if folio could not be immediately locked.
> Follow the same pattern as mmap_lock to drop per-VMA lock when waiting
> for folio and retrying once folio is available.
> With this obstacle removed, enable do_swap_page to operate under
> per-VMA lock protection. Drivers implementing ops->migrate_to_ram might
> still rely on mmap_lock, therefore we have to fall back to mmap_lock in
> that particular case.
> Note that the only time do_swap_page calls synchronous swap_readpage
> is when SWP_SYNCHRONOUS_IO is set, which is only set for
> QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and
> pmem). Therefore we don't sleep in this path, and there's no need to
> drop the mmap or per-VMA lock.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
One nit below:
> ---
> mm/filemap.c | 25 ++++++++++++++++---------
> mm/memory.c | 16 ++++++++++------
> 2 files changed, 26 insertions(+), 15 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 52bcf12dcdbf..7ee078e1a0d2 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1699,31 +1699,38 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
> return ret;
> }
>
> +static void release_fault_lock(struct vm_fault *vmf)
> +{
> + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> + vma_end_read(vmf->vma);
> + else
> + mmap_read_unlock(vmf->vma->vm_mm);
> +}
> +
> /*
> * Return values:
> * 0 - folio is locked.
> * VM_FAULT_RETRY - folio is not locked.
> - * mmap_lock has been released (mmap_read_unlock(), unless flags had both
> - * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
> - * which case mmap_lock is still held.
> + * mmap_lock or per-VMA lock has been released (mmap_read_unlock() or
> + * vma_end_read()), unless flags had both FAULT_FLAG_ALLOW_RETRY and
> + * FAULT_FLAG_RETRY_NOWAIT set, in which case the lock is still held.
> *
> * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
> - * with the folio locked and the mmap_lock unperturbed.
> + * with the folio locked and the mmap_lock/per-VMA lock is left unperturbed.
> */
> vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
> {
> - struct mm_struct *mm = vmf->vma->vm_mm;
> unsigned int flags = vmf->flags;
>
> if (fault_flag_allow_retry_first(flags)) {
> /*
> - * CAUTION! In this case, mmap_lock is not released
> - * even though return VM_FAULT_RETRY.
> + * CAUTION! In this case, mmap_lock/per-VMA lock is not
> + * released even though returning VM_FAULT_RETRY.
> */
> if (flags & FAULT_FLAG_RETRY_NOWAIT)
> return VM_FAULT_RETRY;
>
> - mmap_read_unlock(mm);
> + release_fault_lock(vmf);
> if (flags & FAULT_FLAG_KILLABLE)
> folio_wait_locked_killable(folio);
> else
> @@ -1735,7 +1742,7 @@ vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
>
> ret = __folio_lock_killable(folio);
> if (ret) {
> - mmap_read_unlock(mm);
> + release_fault_lock(vmf);
> return VM_FAULT_RETRY;
> }
> } else {
> diff --git a/mm/memory.c b/mm/memory.c
> index 345080052003..76c7907e7286 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3712,12 +3712,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (!pte_unmap_same(vmf))
> goto out;
>
> - if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> - ret = VM_FAULT_RETRY;
> - vma_end_read(vma);
> - goto out;
> - }
> -
> entry = pte_to_swp_entry(vmf->orig_pte);
> if (unlikely(non_swap_entry(entry))) {
> if (is_migration_entry(entry)) {
> @@ -3727,6 +3721,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> vmf->page = pfn_swap_entry_to_page(entry);
> ret = remove_device_exclusive_entry(vmf);
> } else if (is_device_private_entry(entry)) {
> + if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> + /*
> + * migrate_to_ram is not yet ready to operate
> + * under VMA lock.
> + */
> + vma_end_read(vma);
> + ret |= VM_FAULT_RETRY;
Here IIUC ret==0 is guaranteed, so maybe "ret = VM_FAULT_RETRY" is slightly
clearer.
> + goto out;
> + }
> +
> vmf->page = pfn_swap_entry_to_page(entry);
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> vmf->address, &vmf->ptl);
> --
> 2.41.0.162.gfafddb0af9-goog
>
--
Peter Xu
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 6/6] mm: handle userfaults under VMA lock
2023-06-28 7:18 ` [PATCH v4 6/6] mm: handle userfaults under VMA lock Suren Baghdasaryan
2023-06-28 7:24 ` Suren Baghdasaryan
@ 2023-06-28 13:57 ` Peter Xu
2023-06-28 16:22 ` Suren Baghdasaryan
1 sibling, 1 reply; 15+ messages in thread
From: Peter Xu @ 2023-06-28 13:57 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
dave, punit.agrawal, lstoakes, hdanton, apopple, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
linux-mm, linux-fsdevel, linux-kernel, kernel-team
On Wed, Jun 28, 2023 at 12:18:00AM -0700, Suren Baghdasaryan wrote:
> Enable handle_userfault to operate under VMA lock by releasing VMA lock
> instead of mmap_lock and retrying. Note that FAULT_FLAG_RETRY_NOWAIT
> should never be used when handling faults under per-VMA lock protection
> because that would break the assumption that lock is dropped on retry.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Besides the NOWAIT typo all look sane. Since there seems to need at least
one more version I'll still comment on a few things..
> ---
> fs/userfaultfd.c | 39 ++++++++++++++++++---------------------
> include/linux/mm.h | 39 +++++++++++++++++++++++++++++++++++++++
> mm/filemap.c | 8 --------
> mm/memory.c | 9 ---------
> 4 files changed, 57 insertions(+), 38 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 4e800bb7d2ab..d019e7df6f15 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -277,17 +277,16 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
> * hugepmd ranges.
> */
> static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> - struct vm_area_struct *vma,
> - unsigned long address,
> - unsigned long flags,
> - unsigned long reason)
> + struct vm_fault *vmf,
> + unsigned long reason)
> {
> + struct vm_area_struct *vma = vmf->vma;
> pte_t *ptep, pte;
> bool ret = true;
>
> - mmap_assert_locked(ctx->mm);
> + assert_fault_locked(ctx->mm, vmf);
AFAIU ctx->mm must be the same as vma->vm_mm here, so maybe we can also
drop *ctx here altogether if we've already dropped plenty.
>
> - ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
> + ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma));
> if (!ptep)
> goto out;
>
> @@ -308,10 +307,8 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> }
> #else
> static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> - struct vm_area_struct *vma,
> - unsigned long address,
> - unsigned long flags,
> - unsigned long reason)
> + struct vm_fault *vmf,
> + unsigned long reason)
> {
> return false; /* should never get here */
> }
> @@ -325,11 +322,11 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> * threads.
> */
> static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> - unsigned long address,
> - unsigned long flags,
> + struct vm_fault *vmf,
> unsigned long reason)
> {
> struct mm_struct *mm = ctx->mm;
> + unsigned long address = vmf->address;
> pgd_t *pgd;
> p4d_t *p4d;
> pud_t *pud;
> @@ -337,7 +334,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> pte_t *pte;
> bool ret = true;
>
> - mmap_assert_locked(mm);
> + assert_fault_locked(mm, vmf);
>
> pgd = pgd_offset(mm, address);
> if (!pgd_present(*pgd))
> @@ -445,7 +442,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> * Coredumping runs without mmap_lock so we can only check that
> * the mmap_lock is held, if PF_DUMPCORE was not set.
> */
> - mmap_assert_locked(mm);
> + assert_fault_locked(mm, vmf);
>
> ctx = vma->vm_userfaultfd_ctx.ctx;
> if (!ctx)
> @@ -522,8 +519,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> * and wait.
> */
> ret = VM_FAULT_RETRY;
> - if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> + if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
> + /* Per-VMA lock is expected to be dropped on VM_FAULT_RETRY */
> + BUG_ON(vmf->flags & FAULT_FLAG_RETRY_NOWAIT);
Here is not the only place that we can have FAULT_FLAG_RETRY_NOWAIT.
E.g. folio_lock_or_retry() can also get it, so check here may or may not
help much.
The other thing is please consider not using BUG_ON if possible.
WARN_ON_ONCE() is IMHO always more preferred if the kernel can still try to
run even if it triggers.
I'd rather drop this change, leaving space for future when vma lock may be
supported in gup paths with NOWAIT, then here it'll work naturally, afaiu.
If we really want a sanity check, maybe the best place is when entering
handle_mm_fault(), to be explicit, sanitize_fault_flags().
> goto out;
> + }
>
> /* take the reference before dropping the mmap_lock */
> userfaultfd_ctx_get(ctx);
> @@ -561,15 +561,12 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>
> if (!is_vm_hugetlb_page(vma))
> - must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
> - reason);
> + must_wait = userfaultfd_must_wait(ctx, vmf, reason);
> else
> - must_wait = userfaultfd_huge_must_wait(ctx, vma,
> - vmf->address,
> - vmf->flags, reason);
> + must_wait = userfaultfd_huge_must_wait(ctx, vmf, reason);
> if (is_vm_hugetlb_page(vma))
> hugetlb_vma_unlock_read(vma);
> - mmap_read_unlock(mm);
> + release_fault_lock(vmf);
>
> if (likely(must_wait && !READ_ONCE(ctx->released))) {
> wake_up_poll(&ctx->fd_wqh, EPOLLIN);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fec149585985..70bb2f923e33 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -705,6 +705,17 @@ static inline bool vma_try_start_write(struct vm_area_struct *vma)
> return true;
> }
>
> +static inline void vma_assert_locked(struct vm_area_struct *vma)
> +{
> + int mm_lock_seq;
> +
> + if (__is_vma_write_locked(vma, &mm_lock_seq))
> + return;
> +
> + lockdep_assert_held(&vma->vm_lock->lock);
> + VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_lock->lock), vma);
> +}
> +
> static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> {
> int mm_lock_seq;
> @@ -723,6 +734,23 @@ static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
> struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> unsigned long address);
>
> +static inline
> +void assert_fault_locked(struct mm_struct *mm, struct vm_fault *vmf)
> +{
> + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> + vma_assert_locked(vmf->vma);
> + else
> + mmap_assert_locked(mm);
> +}
> +
> +static inline void release_fault_lock(struct vm_fault *vmf)
> +{
> + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> + vma_end_read(vmf->vma);
> + else
> + mmap_read_unlock(vmf->vma->vm_mm);
> +}
> +
> #else /* CONFIG_PER_VMA_LOCK */
>
> static inline void vma_init_lock(struct vm_area_struct *vma) {}
> @@ -736,6 +764,17 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
> static inline void vma_mark_detached(struct vm_area_struct *vma,
> bool detached) {}
>
> +static inline
> +void assert_fault_locked(struct mm_struct *mm, struct vm_fault *vmf)
> +{
> + mmap_assert_locked(mm);
> +}
> +
> +static inline void release_fault_lock(struct vm_fault *vmf)
> +{
> + mmap_read_unlock(vmf->vma->vm_mm);
> +}
> +
> #endif /* CONFIG_PER_VMA_LOCK */
>
> /*
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7ee078e1a0d2..d4d8f474e0c5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1699,14 +1699,6 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
> return ret;
> }
>
> -static void release_fault_lock(struct vm_fault *vmf)
> -{
> - if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> - vma_end_read(vmf->vma);
> - else
> - mmap_read_unlock(vmf->vma->vm_mm);
> -}
The movement is fine but may not be the cleanest. It'll be nicer to me if
it's put at the right place when introduced - after all in the same series.
Thanks,
> -
> /*
> * Return values:
> * 0 - folio is locked.
> diff --git a/mm/memory.c b/mm/memory.c
> index 76c7907e7286..c6c759922f39 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5294,15 +5294,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> if (!vma_start_read(vma))
> goto inval;
>
> - /*
> - * Due to the possibility of userfault handler dropping mmap_lock, avoid
> - * it for now and fall back to page fault handling under mmap_lock.
> - */
> - if (userfaultfd_armed(vma)) {
> - vma_end_read(vma);
> - goto inval;
> - }
> -
> /* Check since vm_start/vm_end might change before we lock the VMA */
> if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> vma_end_read(vma);
> --
> 2.41.0.162.gfafddb0af9-goog
>
--
Peter Xu
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 5/6] mm: handle swap page faults under per-VMA lock
2023-06-28 13:43 ` Peter Xu
@ 2023-06-28 16:01 ` Suren Baghdasaryan
0 siblings, 0 replies; 15+ messages in thread
From: Suren Baghdasaryan @ 2023-06-28 16:01 UTC (permalink / raw)
To: Peter Xu
Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
dave, punit.agrawal, lstoakes, hdanton, apopple, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
linux-mm, linux-fsdevel, linux-kernel, kernel-team
On Wed, Jun 28, 2023 at 6:43 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jun 28, 2023 at 12:17:59AM -0700, Suren Baghdasaryan wrote:
> > When page fault is handled under per-VMA lock protection, all swap page
> > faults are retried with mmap_lock because folio_lock_or_retry has to drop
> > and reacquire mmap_lock if folio could not be immediately locked.
> > Follow the same pattern as mmap_lock to drop per-VMA lock when waiting
> > for folio and retrying once folio is available.
> > With this obstacle removed, enable do_swap_page to operate under
> > per-VMA lock protection. Drivers implementing ops->migrate_to_ram might
> > still rely on mmap_lock, therefore we have to fall back to mmap_lock in
> > that particular case.
> > Note that the only time do_swap_page calls synchronous swap_readpage
> > is when SWP_SYNCHRONOUS_IO is set, which is only set for
> > QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and
> > pmem). Therefore we don't sleep in this path, and there's no need to
> > drop the mmap or per-VMA lock.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Acked-by: Peter Xu <peterx@redhat.com>
>
> One nit below:
>
> > ---
> > mm/filemap.c | 25 ++++++++++++++++---------
> > mm/memory.c | 16 ++++++++++------
> > 2 files changed, 26 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 52bcf12dcdbf..7ee078e1a0d2 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1699,31 +1699,38 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
> > return ret;
> > }
> >
> > +static void release_fault_lock(struct vm_fault *vmf)
> > +{
> > + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > + vma_end_read(vmf->vma);
> > + else
> > + mmap_read_unlock(vmf->vma->vm_mm);
> > +}
> > +
> > /*
> > * Return values:
> > * 0 - folio is locked.
> > * VM_FAULT_RETRY - folio is not locked.
> > - * mmap_lock has been released (mmap_read_unlock(), unless flags had both
> > - * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
> > - * which case mmap_lock is still held.
> > + * mmap_lock or per-VMA lock has been released (mmap_read_unlock() or
> > + * vma_end_read()), unless flags had both FAULT_FLAG_ALLOW_RETRY and
> > + * FAULT_FLAG_RETRY_NOWAIT set, in which case the lock is still held.
> > *
> > * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
> > - * with the folio locked and the mmap_lock unperturbed.
> > + * with the folio locked and the mmap_lock/per-VMA lock is left unperturbed.
> > */
> > vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
> > {
> > - struct mm_struct *mm = vmf->vma->vm_mm;
> > unsigned int flags = vmf->flags;
> >
> > if (fault_flag_allow_retry_first(flags)) {
> > /*
> > - * CAUTION! In this case, mmap_lock is not released
> > - * even though return VM_FAULT_RETRY.
> > + * CAUTION! In this case, mmap_lock/per-VMA lock is not
> > + * released even though returning VM_FAULT_RETRY.
> > */
> > if (flags & FAULT_FLAG_RETRY_NOWAIT)
> > return VM_FAULT_RETRY;
> >
> > - mmap_read_unlock(mm);
> > + release_fault_lock(vmf);
> > if (flags & FAULT_FLAG_KILLABLE)
> > folio_wait_locked_killable(folio);
> > else
> > @@ -1735,7 +1742,7 @@ vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
> >
> > ret = __folio_lock_killable(folio);
> > if (ret) {
> > - mmap_read_unlock(mm);
> > + release_fault_lock(vmf);
> > return VM_FAULT_RETRY;
> > }
> > } else {
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 345080052003..76c7907e7286 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3712,12 +3712,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > if (!pte_unmap_same(vmf))
> > goto out;
> >
> > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > - ret = VM_FAULT_RETRY;
> > - vma_end_read(vma);
> > - goto out;
> > - }
> > -
> > entry = pte_to_swp_entry(vmf->orig_pte);
> > if (unlikely(non_swap_entry(entry))) {
> > if (is_migration_entry(entry)) {
> > @@ -3727,6 +3721,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > vmf->page = pfn_swap_entry_to_page(entry);
> > ret = remove_device_exclusive_entry(vmf);
> > } else if (is_device_private_entry(entry)) {
> > + if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > + /*
> > + * migrate_to_ram is not yet ready to operate
> > + * under VMA lock.
> > + */
> > + vma_end_read(vma);
> > + ret |= VM_FAULT_RETRY;
>
> Here IIUC ret==0 is guaranteed, so maybe "ret = VM_FAULT_RETRY" is slightly
> clearer.
Ack.
>
> > + goto out;
> > + }
> > +
> > vmf->page = pfn_swap_entry_to_page(entry);
> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > vmf->address, &vmf->ptl);
> > --
> > 2.41.0.162.gfafddb0af9-goog
> >
>
> --
> Peter Xu
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 6/6] mm: handle userfaults under VMA lock
2023-06-28 13:57 ` Peter Xu
@ 2023-06-28 16:22 ` Suren Baghdasaryan
2023-06-28 16:47 ` Peter Xu
0 siblings, 1 reply; 15+ messages in thread
From: Suren Baghdasaryan @ 2023-06-28 16:22 UTC (permalink / raw)
To: Peter Xu
Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
dave, punit.agrawal, lstoakes, hdanton, apopple, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
linux-mm, linux-fsdevel, linux-kernel, kernel-team
On Wed, Jun 28, 2023 at 6:58 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jun 28, 2023 at 12:18:00AM -0700, Suren Baghdasaryan wrote:
> > Enable handle_userfault to operate under VMA lock by releasing VMA lock
> > instead of mmap_lock and retrying. Note that FAULT_FLAG_RETRY_NOWAIT
> > should never be used when handling faults under per-VMA lock protection
> > because that would break the assumption that lock is dropped on retry.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Besides the NOWAIT typo all look sane. Since there seems to need at least
> one more version I'll still comment on a few things..
>
> > ---
> > fs/userfaultfd.c | 39 ++++++++++++++++++---------------------
> > include/linux/mm.h | 39 +++++++++++++++++++++++++++++++++++++++
> > mm/filemap.c | 8 --------
> > mm/memory.c | 9 ---------
> > 4 files changed, 57 insertions(+), 38 deletions(-)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 4e800bb7d2ab..d019e7df6f15 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -277,17 +277,16 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
> > * hugepmd ranges.
> > */
> > static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > - struct vm_area_struct *vma,
> > - unsigned long address,
> > - unsigned long flags,
> > - unsigned long reason)
> > + struct vm_fault *vmf,
> > + unsigned long reason)
> > {
> > + struct vm_area_struct *vma = vmf->vma;
> > pte_t *ptep, pte;
> > bool ret = true;
> >
> > - mmap_assert_locked(ctx->mm);
> > + assert_fault_locked(ctx->mm, vmf);
>
> AFAIU ctx->mm must be the same as vma->vm_mm here, so maybe we can also
> drop *ctx here altogether if we've already dropped plenty.
Ack. I was not sure if the ctx->mm would always be the same as vmf->mm.
>
> >
> > - ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
> > + ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma));
> > if (!ptep)
> > goto out;
> >
> > @@ -308,10 +307,8 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > }
> > #else
> > static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > - struct vm_area_struct *vma,
> > - unsigned long address,
> > - unsigned long flags,
> > - unsigned long reason)
> > + struct vm_fault *vmf,
> > + unsigned long reason)
> > {
> > return false; /* should never get here */
> > }
> > @@ -325,11 +322,11 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > * threads.
> > */
> > static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> > - unsigned long address,
> > - unsigned long flags,
> > + struct vm_fault *vmf,
> > unsigned long reason)
> > {
> > struct mm_struct *mm = ctx->mm;
> > + unsigned long address = vmf->address;
> > pgd_t *pgd;
> > p4d_t *p4d;
> > pud_t *pud;
> > @@ -337,7 +334,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> > pte_t *pte;
> > bool ret = true;
> >
> > - mmap_assert_locked(mm);
> > + assert_fault_locked(mm, vmf);
> >
> > pgd = pgd_offset(mm, address);
> > if (!pgd_present(*pgd))
> > @@ -445,7 +442,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> > * Coredumping runs without mmap_lock so we can only check that
> > * the mmap_lock is held, if PF_DUMPCORE was not set.
> > */
> > - mmap_assert_locked(mm);
> > + assert_fault_locked(mm, vmf);
> >
> > ctx = vma->vm_userfaultfd_ctx.ctx;
> > if (!ctx)
> > @@ -522,8 +519,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> > * and wait.
> > */
> > ret = VM_FAULT_RETRY;
> > - if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> > + if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
> > + /* Per-VMA lock is expected to be dropped on VM_FAULT_RETRY */
> > + BUG_ON(vmf->flags & FAULT_FLAG_RETRY_NOWAIT);
>
> Here is not the only place that we can have FAULT_FLAG_RETRY_NOWAIT.
> E.g. folio_lock_or_retry() can also get it, so check here may or may not
> help much.
>
> The other thing is please consider not using BUG_ON if possible.
> WARN_ON_ONCE() is IMHO always more preferred if the kernel can still try to
> run even if it triggers.
Ack.
>
> I'd rather drop this change, leaving space for future when vma lock may be
> supported in gup paths with NOWAIT, then here it'll work naturally, afaiu.
> If we really want a sanity check, maybe the best place is when entering
> handle_mm_fault(), to be explicit, sanitize_fault_flags().
That sounds like a good idea. Thanks!
>
> > goto out;
> > + }
> >
> > /* take the reference before dropping the mmap_lock */
> > userfaultfd_ctx_get(ctx);
> > @@ -561,15 +561,12 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> > spin_unlock_irq(&ctx->fault_pending_wqh.lock);
> >
> > if (!is_vm_hugetlb_page(vma))
> > - must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
> > - reason);
> > + must_wait = userfaultfd_must_wait(ctx, vmf, reason);
> > else
> > - must_wait = userfaultfd_huge_must_wait(ctx, vma,
> > - vmf->address,
> > - vmf->flags, reason);
> > + must_wait = userfaultfd_huge_must_wait(ctx, vmf, reason);
> > if (is_vm_hugetlb_page(vma))
> > hugetlb_vma_unlock_read(vma);
> > - mmap_read_unlock(mm);
> > + release_fault_lock(vmf);
> >
> > if (likely(must_wait && !READ_ONCE(ctx->released))) {
> > wake_up_poll(&ctx->fd_wqh, EPOLLIN);
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index fec149585985..70bb2f923e33 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -705,6 +705,17 @@ static inline bool vma_try_start_write(struct vm_area_struct *vma)
> > return true;
> > }
> >
> > +static inline void vma_assert_locked(struct vm_area_struct *vma)
> > +{
> > + int mm_lock_seq;
> > +
> > + if (__is_vma_write_locked(vma, &mm_lock_seq))
> > + return;
> > +
> > + lockdep_assert_held(&vma->vm_lock->lock);
> > + VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_lock->lock), vma);
> > +}
> > +
> > static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > {
> > int mm_lock_seq;
> > @@ -723,6 +734,23 @@ static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
> > struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > unsigned long address);
> >
> > +static inline
> > +void assert_fault_locked(struct mm_struct *mm, struct vm_fault *vmf)
> > +{
> > + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > + vma_assert_locked(vmf->vma);
> > + else
> > + mmap_assert_locked(mm);
> > +}
> > +
> > +static inline void release_fault_lock(struct vm_fault *vmf)
> > +{
> > + if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > + vma_end_read(vmf->vma);
> > + else
> > + mmap_read_unlock(vmf->vma->vm_mm);
> > +}
> > +
> > #else /* CONFIG_PER_VMA_LOCK */
> >
> > static inline void vma_init_lock(struct vm_area_struct *vma) {}
> > @@ -736,6 +764,17 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
> > static inline void vma_mark_detached(struct vm_area_struct *vma,
> > bool detached) {}
> >
> > +static inline
> > +void assert_fault_locked(struct mm_struct *mm, struct vm_fault *vmf)
> > +{
> > + mmap_assert_locked(mm);
> > +}
> > +
> > +static inline void release_fault_lock(struct vm_fault *vmf)
> > +{
> > + mmap_read_unlock(vmf->vma->vm_mm);
> > +}
> > +
> > #endif /* CONFIG_PER_VMA_LOCK */
> >
> > /*
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 7ee078e1a0d2..d4d8f474e0c5 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1699,14 +1699,6 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
> > return ret;
> > }
> >
> > -static void release_fault_lock(struct vm_fault *vmf)
> > -{
> > - if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > - vma_end_read(vmf->vma);
> > - else
> > - mmap_read_unlock(vmf->vma->vm_mm);
> > -}
>
> The movement is fine but may not be the cleanest. It'll be nicer to me if
> it's put at the right place when introduced - after all in the same series.
Ack.
Thanks for the review!
>
> Thanks,
>
> > -
> > /*
> > * Return values:
> > * 0 - folio is locked.
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 76c7907e7286..c6c759922f39 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5294,15 +5294,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > if (!vma_start_read(vma))
> > goto inval;
> >
> > - /*
> > - * Due to the possibility of userfault handler dropping mmap_lock, avoid
> > - * it for now and fall back to page fault handling under mmap_lock.
> > - */
> > - if (userfaultfd_armed(vma)) {
> > - vma_end_read(vma);
> > - goto inval;
> > - }
> > -
> > /* Check since vm_start/vm_end might change before we lock the VMA */
> > if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > vma_end_read(vma);
> > --
> > 2.41.0.162.gfafddb0af9-goog
> >
>
> --
> Peter Xu
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 6/6] mm: handle userfaults under VMA lock
2023-06-28 16:22 ` Suren Baghdasaryan
@ 2023-06-28 16:47 ` Peter Xu
0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2023-06-28 16:47 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: akpm, willy, hannes, mhocko, josef, jack, ldufour,
laurent.dufour, michel, liam.howlett, jglisse, vbabka, minchan,
dave, punit.agrawal, lstoakes, hdanton, apopple, ying.huang,
david, yuzhao, dhowells, hughd, viro, brauner, pasha.tatashin,
linux-mm, linux-fsdevel, linux-kernel, kernel-team
On Wed, Jun 28, 2023 at 09:22:24AM -0700, Suren Baghdasaryan wrote:
> Ack. I was not sure if the ctx->mm would always be the same as vmf->mm.
Feel free to look at the entrance of handle_userfault(), where there's:
struct vm_area_struct *vma = vmf->vma;
struct mm_struct *mm = vma->vm_mm;
struct userfaultfd_ctx *ctx;
...
ctx = vma->vm_userfaultfd_ctx.ctx;
...
BUG_ON(ctx->mm != mm);
...
So I think we should be safe. Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2023-06-28 16:47 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-28 7:17 [PATCH v4 0/6] Per-VMA lock support for swap and userfaults Suren Baghdasaryan
2023-06-28 7:17 ` [PATCH v4 1/6] swap: remove remnants of polling from read_swap_cache_async Suren Baghdasaryan
2023-06-28 7:17 ` [PATCH v4 2/6] mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED Suren Baghdasaryan
2023-06-28 7:17 ` [PATCH v4 3/6] mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED Suren Baghdasaryan
2023-06-28 13:41 ` Peter Xu
2023-06-28 7:17 ` [PATCH v4 4/6] mm: change folio_lock_or_retry to use vm_fault directly Suren Baghdasaryan
2023-06-28 13:42 ` Peter Xu
2023-06-28 7:17 ` [PATCH v4 5/6] mm: handle swap page faults under per-VMA lock Suren Baghdasaryan
2023-06-28 13:43 ` Peter Xu
2023-06-28 16:01 ` Suren Baghdasaryan
2023-06-28 7:18 ` [PATCH v4 6/6] mm: handle userfaults under VMA lock Suren Baghdasaryan
2023-06-28 7:24 ` Suren Baghdasaryan
2023-06-28 13:57 ` Peter Xu
2023-06-28 16:22 ` Suren Baghdasaryan
2023-06-28 16:47 ` Peter Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox