* [PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook
@ 2025-05-09 12:13 Lorenzo Stoakes
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-09 12:13 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm,
Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox
During the mmap() of a file-backed mapping, we invoke the underlying driver
file's mmap() callback in order to perform driver/file system
initialisation of the underlying VMA.
This has been a source of issues in the past, including a significant
security concern relating to unwinding of error state discovered by Jann
Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
error path behaviour") which performed the recent, significant, rework of
mmap() as a whole.
However, we have had a fly in the ointment remain - drivers have a great
deal of freedom in the .mmap() hook to manipulate VMA state (as well as
page table state).
This can be problematic, as we can no longer reason sensibly about VMA
state once the call is complete (the ability to do - anything - here does
rather interfere with that).
In addition, callers may choose to do odd or unusual things which might
interfere with subsequent steps in the mmap() process, and it may do so and
then raise an error, requiring very careful unwinding of state about which
we can make no assumptions.
Rather than providing such an open-ended interface, this series provides an
alternative, far more restrictive one - we expose a whitelist of fields
which can be adjusted by the driver, along with immutable state upon which
the driver can make such decisions:
struct vm_area_desc {
/* Immutable state. */
struct mm_struct *mm;
unsigned long start;
unsigned long end;
/* Mutable fields. Populated with initial state. */
pgoff_t pgoff;
struct file *file;
vm_flags_t vm_flags;
pgprot_t page_prot;
/* Write-only fields. */
const struct vm_operations_struct *vm_ops;
void *private_data;
};
The mmap logic then updates the state used to either merge with a VMA or
establish a new VMA based upon this logic.
This is achieved via new file hook .mmap_prepare(), which is, importantly,
invoked very early on in the mmap() process.
If an error arises, we can very simply abort the operation with very little
unwinding of state required.
The existing logic contains another, related, peccadillo - since the
.mmap() callback might do anything, it may also cause a previously
unmergeable VMA to become mergeable with adjacent VMAs.
Right now the logic will retry a merge like this only if the driver changes
VMA flags, and changes them in such a way that a merge might succeed (that
is, the flags are not 'special', that is do not contain any of the flags
specified in VM_SPECIAL).
This has also been the source of a great deal of pain - it's hard to
reason about an .mmap() callback that might do - anything - but it's also
hard to reason about setting up a VMA and writing to the maple tree, only
to do it again utilising a great deal of shared state.
Since .mmap_prepare() sets fields before the first merge is even attempted,
the use of this callback obviates the need for this retry merge logic.
A driver may only specify .mmap_prepare() or the deprecated .mmap()
callback. In future we may add futher callbacks beyond .mmap_prepare() to
faciliate all use cass as we convert drivers.
In researching this change, I examined every .mmap() callback, and
discovered only a very few that set VMA state in such a way that a. the VMA
flags changed and b. this would be mergeable.
In the majority of cases, it turns out that drivers are mapping kernel
memory and thus ultimately set VM_PFNMAP, VM_MIXEDMAP, or other unmergeable
VM_SPECIAL flags.
Of those that remain I identified a number of cases which are only
applicable in DAX, setting the VM_HUGEPAGE flag:
* dax_mmap()
* erofs_file_mmap()
* ext4_file_mmap()
* xfs_file_mmap()
For this remerge to not occur and to impact users, each of these cases
would require a user to mmap() files using DAX, in parts, immediately
adjacent to one another.
This is a very unlikely usecase and so it does not appear to be worthwhile
to adjust this functionality accordingly.
We can, however, very quickly do so if needed by simply adding an
.mmap_prepare() callback to these as required.
There are two further non-DAX cases I idenitfied:
* orangefs_file_mmap() - Clears VM_RAND_READ if set, replacing with
VM_SEQ_READ.
* usb_stream_hwdep_mmap() - Sets VM_DONTDUMP.
Both of these cases again seem very unlikely to be mmap()'d immediately
adjacent to one another in a fashion that would result in a merge.
Finally, we are left with a viable case:
* secretmem_mmap() - Set VM_LOCKED, VM_DONTDUMP.
This is viable enough that the mm selftests trigger the logic as a matter
of course. Therefore, this series replace the .secretmem_mmap() hook with
.secret_mmap_prepare().
v2:
* Made len variable in secretmem_mmap_prepare() constant as per David.
* Dropped file_has_deprecated_mmap_hook() and file_has_mmap_prepare()
helpers as unnecessary as per David.
* Added WARN_ON_ONCE() to conditions that should not happen in practice as
per David.
* Updated vma_internal.h accordingly, added todo to fixup duplication here
in a future series, as discussed with David.
* Dropped unnecessary 'at least one must be specified' comment as per
David.
* Propagated tags (thanks David!)
* Removed useless file_has_valid_mmap_hooks() check from
call_mmap_prepare() - if the .mmap_prepare() hook is not set correctly,
this will blow up anyway.
v1:
* Seems generally supported, so un-RFC :)
* Propagated tag, thanks Mike!
https://lore.kernel.org/all/cover.1746615512.git.lorenzo.stoakes@oracle.com/
RFC v2:
* Renamed .mmap_proto() to .mmap_prepare() as per David.
* Made .mmap_prepare(), .mmap() mutually exclusive.
* Updated call_mmap() to bail out if .mmap_prepare() callback present as per Jann.
* Renamed vma_proto to vm_area_desc as per Mike.
* Added accidentally missing page_prot assignment/read from vm_area_desc.
* Renamed vm_area_desc->flags to vm_flags as per Liam.
* Added [__]call_mmap_prepare() for consistency with call_mmap().
* Added file_has_xxx_hook() helpers.
* Renamed file_has_valid_mmap() to file_has_valid_mmap_hooks() and check that
the hook is mutually exclusive.
https://lore.kernel.org/all/cover.1746116777.git.lorenzo.stoakes@oracle.com/
RFC v1:
https://lore.kernel.org/all/cover.1746040540.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (3):
mm: introduce new .mmap_prepare() file callback
mm: secretmem: convert to .mmap_prepare() hook
mm/vma: remove mmap() retry merge
include/linux/fs.h | 25 ++++++++++
include/linux/mm_types.h | 24 ++++++++++
mm/memory.c | 3 +-
mm/mmap.c | 2 +-
mm/secretmem.c | 14 +++---
mm/vma.c | 80 ++++++++++++++++++++++++++------
tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++--
7 files changed, 186 insertions(+), 28 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-09 12:13 [PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook Lorenzo Stoakes
@ 2025-05-09 12:13 ` Lorenzo Stoakes
2025-05-12 9:24 ` Christian Brauner
` (4 more replies)
2025-05-09 12:13 ` [PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook Lorenzo Stoakes
2025-05-09 12:13 ` [PATCH v2 3/3] mm/vma: remove mmap() retry merge Lorenzo Stoakes
2 siblings, 5 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-09 12:13 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm,
Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox
Provide a means by which drivers can specify which fields of those
permitted to be changed should be altered to prior to mmap()'ing a
range (which may either result from a merge or from mapping an entirely new
VMA).
Doing so is substantially safer than the existing .mmap() calback which
provides unrestricted access to the part-constructed VMA and permits
drivers and file systems to do 'creative' things which makes it hard to
reason about the state of the VMA after the function returns.
The existing .mmap() callback's freedom has caused a great deal of issues,
especially in error handling, as unwinding the mmap() state has proven to
be non-trivial and caused significant issues in the past, for instance
those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
error path behaviour").
It also necessitates a second attempt at merge once the .mmap() callback
has completed, which has caused issues in the past, is awkward, adds
overhead and is difficult to reason about.
The .mmap_prepare() callback eliminates this requirement, as we can update
fields prior to even attempting the first merge. It is safer, as we heavily
restrict what can actually be modified, and being invoked very early in the
mmap() process, error handling can be performed safely with very little
unwinding of state required.
The .mmap_prepare() and deprecated .mmap() callbacks are mutually
exclusive, so we permit only one to be invoked at a time.
Update vma userland test stubs to account for changes.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
include/linux/fs.h | 25 ++++++++++++
include/linux/mm_types.h | 24 +++++++++++
mm/memory.c | 3 +-
mm/mmap.c | 2 +-
mm/vma.c | 68 +++++++++++++++++++++++++++++++-
tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++++---
6 files changed, 180 insertions(+), 8 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 016b0fe1536e..e2721a1ff13d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2169,6 +2169,7 @@ struct file_operations {
int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
unsigned int poll_flags);
+ int (*mmap_prepare)(struct vm_area_desc *);
} __randomize_layout;
/* Supports async buffered reads */
@@ -2238,11 +2239,35 @@ struct inode_operations {
struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
} ____cacheline_aligned;
+/* Did the driver provide valid mmap hook configuration? */
+static inline bool file_has_valid_mmap_hooks(struct file *file)
+{
+ bool has_mmap = file->f_op->mmap;
+ bool has_mmap_prepare = file->f_op->mmap_prepare;
+
+ /* Hooks are mutually exclusive. */
+ if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
+ return false;
+ if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
+ return false;
+
+ return true;
+}
+
static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
{
+ if (WARN_ON_ONCE(file->f_op->mmap_prepare))
+ return -EINVAL;
+
return file->f_op->mmap(file, vma);
}
+static inline int __call_mmap_prepare(struct file *file,
+ struct vm_area_desc *desc)
+{
+ return file->f_op->mmap_prepare(desc);
+}
+
extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e76bade9ebb1..15808cad2bc1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -763,6 +763,30 @@ struct vma_numab_state {
int prev_scan_seq;
};
+/*
+ * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
+ * manipulate mutable fields which will cause those fields to be updated in the
+ * resultant VMA.
+ *
+ * Helper functions are not required for manipulating any field.
+ */
+struct vm_area_desc {
+ /* Immutable state. */
+ struct mm_struct *mm;
+ unsigned long start;
+ unsigned long end;
+
+ /* Mutable fields. Populated with initial state. */
+ pgoff_t pgoff;
+ struct file *file;
+ vm_flags_t vm_flags;
+ pgprot_t page_prot;
+
+ /* Write-only fields. */
+ const struct vm_operations_struct *vm_ops;
+ void *private_data;
+};
+
/*
* This struct describes a virtual memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
diff --git a/mm/memory.c b/mm/memory.c
index 68c1d962d0ad..99af83434e7c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -527,10 +527,11 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
dump_page(page, "bad pte");
pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
(void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
- pr_alert("file:%pD fault:%ps mmap:%ps read_folio:%ps\n",
+ pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
vma->vm_file,
vma->vm_ops ? vma->vm_ops->fault : NULL,
vma->vm_file ? vma->vm_file->f_op->mmap : NULL,
+ vma->vm_file ? vma->vm_file->f_op->mmap_prepare : NULL,
mapping ? mapping->a_ops->read_folio : NULL);
dump_stack();
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
diff --git a/mm/mmap.c b/mm/mmap.c
index 81dd962a1cfc..50f902c08341 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -475,7 +475,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
vm_flags &= ~VM_MAYEXEC;
}
- if (!file->f_op->mmap)
+ if (!file_has_valid_mmap_hooks(file))
return -ENODEV;
if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
return -EINVAL;
diff --git a/mm/vma.c b/mm/vma.c
index 1f2634b29568..3f32e04bb6cc 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -17,6 +17,11 @@ struct mmap_state {
unsigned long pglen;
unsigned long flags;
struct file *file;
+ pgprot_t page_prot;
+
+ /* User-defined fields, perhaps updated by .mmap_prepare(). */
+ const struct vm_operations_struct *vm_ops;
+ void *vm_private_data;
unsigned long charged;
bool retry_merge;
@@ -40,6 +45,7 @@ struct mmap_state {
.pglen = PHYS_PFN(len_), \
.flags = flags_, \
.file = file_, \
+ .page_prot = vm_get_page_prot(flags_), \
}
#define VMG_MMAP_STATE(name, map_, vma_) \
@@ -2385,6 +2391,10 @@ static int __mmap_new_file_vma(struct mmap_state *map,
int error;
vma->vm_file = get_file(map->file);
+
+ if (!map->file->f_op->mmap)
+ return 0;
+
error = mmap_file(vma->vm_file, vma);
if (error) {
fput(vma->vm_file);
@@ -2441,7 +2451,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
vma_iter_config(vmi, map->addr, map->end);
vma_set_range(vma, map->addr, map->end, map->pgoff);
vm_flags_init(vma, map->flags);
- vma->vm_page_prot = vm_get_page_prot(map->flags);
+ vma->vm_page_prot = map->page_prot;
if (vma_iter_prealloc(vmi, vma)) {
error = -ENOMEM;
@@ -2528,6 +2538,56 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
vma_set_page_prot(vma);
}
+/*
+ * Invoke the f_op->mmap_prepare() callback for a file-backed mapping that
+ * specifies it.
+ *
+ * This is called prior to any merge attempt, and updates whitelisted fields
+ * that are permitted to be updated by the caller.
+ *
+ * All but user-defined fields will be pre-populated with original values.
+ *
+ * Returns 0 on success, or an error code otherwise.
+ */
+static int call_mmap_prepare(struct mmap_state *map)
+{
+ int err;
+ struct vm_area_desc desc = {
+ .mm = map->mm,
+ .start = map->addr,
+ .end = map->end,
+
+ .pgoff = map->pgoff,
+ .file = map->file,
+ .vm_flags = map->flags,
+ .page_prot = map->page_prot,
+ };
+
+ /* Invoke the hook. */
+ err = __call_mmap_prepare(map->file, &desc);
+ if (err)
+ return err;
+
+ /* Update fields permitted to be changed. */
+ map->pgoff = desc.pgoff;
+ map->file = desc.file;
+ map->flags = desc.vm_flags;
+ map->page_prot = desc.page_prot;
+ /* User-defined fields. */
+ map->vm_ops = desc.vm_ops;
+ map->vm_private_data = desc.private_data;
+
+ return 0;
+}
+
+static void set_vma_user_defined_fields(struct vm_area_struct *vma,
+ struct mmap_state *map)
+{
+ if (map->vm_ops)
+ vma->vm_ops = map->vm_ops;
+ vma->vm_private_data = map->vm_private_data;
+}
+
static unsigned long __mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
struct list_head *uf)
@@ -2535,10 +2595,13 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
int error;
+ bool have_mmap_prepare = file && file->f_op->mmap_prepare;
VMA_ITERATOR(vmi, mm, addr);
MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
error = __mmap_prepare(&map, uf);
+ if (!error && have_mmap_prepare)
+ error = call_mmap_prepare(&map);
if (error)
goto abort_munmap;
@@ -2556,6 +2619,9 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
goto unacct_error;
}
+ if (have_mmap_prepare)
+ set_vma_user_defined_fields(vma, &map);
+
/* If flags changed, we might be able to merge, so try again. */
if (map.retry_merge) {
struct vm_area_struct *merged;
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 198abe66de5a..f6e45e62da3a 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -253,8 +253,40 @@ struct mm_struct {
unsigned long flags; /* Must use atomic bitops to access */
};
+struct vm_area_struct;
+
+/*
+ * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
+ * manipulate mutable fields which will cause those fields to be updated in the
+ * resultant VMA.
+ *
+ * Helper functions are not required for manipulating any field.
+ */
+struct vm_area_desc {
+ /* Immutable state. */
+ struct mm_struct *mm;
+ unsigned long start;
+ unsigned long end;
+
+ /* Mutable fields. Populated with initial state. */
+ pgoff_t pgoff;
+ struct file *file;
+ vm_flags_t vm_flags;
+ pgprot_t page_prot;
+
+ /* Write-only fields. */
+ const struct vm_operations_struct *vm_ops;
+ void *private_data;
+};
+
+struct file_operations {
+ int (*mmap)(struct file *, struct vm_area_struct *);
+ int (*mmap_prepare)(struct vm_area_desc *);
+};
+
struct file {
struct address_space *f_mapping;
+ const struct file_operations *f_op;
};
#define VMA_LOCK_OFFSET 0x40000000
@@ -1125,11 +1157,6 @@ static inline void vm_flags_clear(struct vm_area_struct *vma,
vma->__vm_flags &= ~flags;
}
-static inline int call_mmap(struct file *, struct vm_area_struct *)
-{
- return 0;
-}
-
static inline int shmem_zero_setup(struct vm_area_struct *)
{
return 0;
@@ -1405,4 +1432,33 @@ static inline void free_anon_vma_name(struct vm_area_struct *vma)
(void)vma;
}
+/* Did the driver provide valid mmap hook configuration? */
+static inline bool file_has_valid_mmap_hooks(struct file *file)
+{
+ bool has_mmap = file->f_op->mmap;
+ bool has_mmap_prepare = file->f_op->mmap_prepare;
+
+ /* Hooks are mutually exclusive. */
+ if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
+ return false;
+ if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
+ return false;
+
+ return true;
+}
+
+static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ if (WARN_ON_ONCE(file->f_op->mmap_prepare))
+ return -EINVAL;
+
+ return file->f_op->mmap(file, vma);
+}
+
+static inline int __call_mmap_prepare(struct file *file,
+ struct vm_area_desc *desc)
+{
+ return file->f_op->mmap_prepare(desc);
+}
+
#endif /* __MM_VMA_INTERNAL_H */
--
2.49.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook
2025-05-09 12:13 [PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook Lorenzo Stoakes
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
@ 2025-05-09 12:13 ` Lorenzo Stoakes
2025-05-12 13:41 ` Vlastimil Babka
2025-05-13 13:23 ` Liam R. Howlett
2025-05-09 12:13 ` [PATCH v2 3/3] mm/vma: remove mmap() retry merge Lorenzo Stoakes
2 siblings, 2 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-09 12:13 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm,
Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox
Secretmem has a simple .mmap() hook which is easily converted to the new
.mmap_prepare() callback.
Importantly, it's a rare instance of an driver that manipulates a VMA which
is mergeable (that is, not a VM_SPECIAL mapping) while also adjusting VMA
flags which may adjust mergeability, meaning the retry merge logic might
impact whether or not the VMA is merged.
By using .mmap_prepare() there's no longer any need to retry the merge
later as we can simply set the correct flags from the start.
This change therefore allows us to remove the retry merge logic in a
subsequent commit.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
mm/secretmem.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 1b0a214ee558..589b26c2d553 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -120,18 +120,18 @@ static int secretmem_release(struct inode *inode, struct file *file)
return 0;
}
-static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+static int secretmem_mmap_prepare(struct vm_area_desc *desc)
{
- unsigned long len = vma->vm_end - vma->vm_start;
+ const unsigned long len = desc->end - desc->start;
- if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
+ if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
return -EINVAL;
- if (!mlock_future_ok(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
+ if (!mlock_future_ok(desc->mm, desc->vm_flags | VM_LOCKED, len))
return -EAGAIN;
- vm_flags_set(vma, VM_LOCKED | VM_DONTDUMP);
- vma->vm_ops = &secretmem_vm_ops;
+ desc->vm_flags |= VM_LOCKED | VM_DONTDUMP;
+ desc->vm_ops = &secretmem_vm_ops;
return 0;
}
@@ -143,7 +143,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma)
static const struct file_operations secretmem_fops = {
.release = secretmem_release,
- .mmap = secretmem_mmap,
+ .mmap_prepare = secretmem_mmap_prepare,
};
static int secretmem_migrate_folio(struct address_space *mapping,
--
2.49.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v2 3/3] mm/vma: remove mmap() retry merge
2025-05-09 12:13 [PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook Lorenzo Stoakes
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
2025-05-09 12:13 ` [PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook Lorenzo Stoakes
@ 2025-05-09 12:13 ` Lorenzo Stoakes
2025-05-12 13:44 ` Vlastimil Babka
2025-05-13 13:25 ` Liam R. Howlett
2 siblings, 2 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-09 12:13 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm,
Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox
We have now introduced a mechanism that obviates the need for a reattempted
merge via the mmap_prepare() file hook, so eliminate this functionality
altogether.
The retry merge logic has been the cause of a great deal of complexity in
the past and required a great deal of careful manoeuvring of code to ensure
its continued and correct functionality.
It has also recently been involved in an issue surrounding maple tree
state, which again points to its problematic nature.
We make it much easier to reason about mmap() logic by eliminating this and
simply writing a VMA once. This also opens the doors to future optimisation
and improvement in the mmap() logic.
For any device or file system which encounters unwanted VMA fragmentation
as a result of this change (that is, having not implemented .mmap_prepare
hooks), the issue is easily resolvable by doing so.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
mm/vma.c | 14 --------------
1 file changed, 14 deletions(-)
diff --git a/mm/vma.c b/mm/vma.c
index 3f32e04bb6cc..3ff6cfbe3338 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -24,7 +24,6 @@ struct mmap_state {
void *vm_private_data;
unsigned long charged;
- bool retry_merge;
struct vm_area_struct *prev;
struct vm_area_struct *next;
@@ -2417,8 +2416,6 @@ static int __mmap_new_file_vma(struct mmap_state *map,
!(map->flags & VM_MAYWRITE) &&
(vma->vm_flags & VM_MAYWRITE));
- /* If the flags change (and are mergeable), let's retry later. */
- map->retry_merge = vma->vm_flags != map->flags && !(vma->vm_flags & VM_SPECIAL);
map->flags = vma->vm_flags;
return 0;
@@ -2622,17 +2619,6 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
if (have_mmap_prepare)
set_vma_user_defined_fields(vma, &map);
- /* If flags changed, we might be able to merge, so try again. */
- if (map.retry_merge) {
- struct vm_area_struct *merged;
- VMG_MMAP_STATE(vmg, &map, vma);
-
- vma_iter_config(map.vmi, map.addr, map.end);
- merged = vma_merge_existing_range(&vmg);
- if (merged)
- vma = merged;
- }
-
__mmap_complete(&map, vma);
return addr;
--
2.49.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
@ 2025-05-12 9:24 ` Christian Brauner
2025-05-12 11:31 ` Lorenzo Stoakes
2025-05-12 13:37 ` Vlastimil Babka
` (3 subsequent siblings)
4 siblings, 1 reply; 23+ messages in thread
From: Christian Brauner @ 2025-05-12 9:24 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm,
Alexander Viro, Jan Kara, Matthew Wilcox
On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
> Provide a means by which drivers can specify which fields of those
> permitted to be changed should be altered to prior to mmap()'ing a
> range (which may either result from a merge or from mapping an entirely new
> VMA).
>
> Doing so is substantially safer than the existing .mmap() calback which
> provides unrestricted access to the part-constructed VMA and permits
> drivers and file systems to do 'creative' things which makes it hard to
> reason about the state of the VMA after the function returns.
>
> The existing .mmap() callback's freedom has caused a great deal of issues,
> especially in error handling, as unwinding the mmap() state has proven to
> be non-trivial and caused significant issues in the past, for instance
> those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> error path behaviour").
>
> It also necessitates a second attempt at merge once the .mmap() callback
> has completed, which has caused issues in the past, is awkward, adds
> overhead and is difficult to reason about.
>
> The .mmap_prepare() callback eliminates this requirement, as we can update
> fields prior to even attempting the first merge. It is safer, as we heavily
> restrict what can actually be modified, and being invoked very early in the
> mmap() process, error handling can be performed safely with very little
> unwinding of state required.
>
> The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> exclusive, so we permit only one to be invoked at a time.
>
> Update vma userland test stubs to account for changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> include/linux/fs.h | 25 ++++++++++++
> include/linux/mm_types.h | 24 +++++++++++
> mm/memory.c | 3 +-
> mm/mmap.c | 2 +-
> mm/vma.c | 68 +++++++++++++++++++++++++++++++-
> tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++++---
> 6 files changed, 180 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 016b0fe1536e..e2721a1ff13d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2169,6 +2169,7 @@ struct file_operations {
> int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
> int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> unsigned int poll_flags);
> + int (*mmap_prepare)(struct vm_area_desc *);
> } __randomize_layout;
>
> /* Supports async buffered reads */
> @@ -2238,11 +2239,35 @@ struct inode_operations {
> struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
> } ____cacheline_aligned;
>
> +/* Did the driver provide valid mmap hook configuration? */
> +static inline bool file_has_valid_mmap_hooks(struct file *file)
> +{
> + bool has_mmap = file->f_op->mmap;
> + bool has_mmap_prepare = file->f_op->mmap_prepare;
> +
> + /* Hooks are mutually exclusive. */
> + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
> + return false;
> + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
> + return false;
> +
> + return true;
> +}
> +
> static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> {
> + if (WARN_ON_ONCE(file->f_op->mmap_prepare))
> + return -EINVAL;
> +
> return file->f_op->mmap(file, vma);
> }
>
> +static inline int __call_mmap_prepare(struct file *file,
> + struct vm_area_desc *desc)
> +{
> + return file->f_op->mmap_prepare(desc);
> +}
nit: I would prefer if we could rename this to vfs_mmap() and
vfs_mmap_prepare() as this is in line with all the other vfs related
helpers we expose.
> +
> extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
> extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
> extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e76bade9ebb1..15808cad2bc1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -763,6 +763,30 @@ struct vma_numab_state {
> int prev_scan_seq;
> };
>
> +/*
> + * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
> + * manipulate mutable fields which will cause those fields to be updated in the
> + * resultant VMA.
> + *
> + * Helper functions are not required for manipulating any field.
> + */
> +struct vm_area_desc {
> + /* Immutable state. */
> + struct mm_struct *mm;
> + unsigned long start;
> + unsigned long end;
> +
> + /* Mutable fields. Populated with initial state. */
> + pgoff_t pgoff;
> + struct file *file;
> + vm_flags_t vm_flags;
> + pgprot_t page_prot;
> +
> + /* Write-only fields. */
> + const struct vm_operations_struct *vm_ops;
> + void *private_data;
> +};
> +
> /*
> * This struct describes a virtual memory area. There is one of these
> * per VM-area/task. A VM area is any part of the process virtual memory
> diff --git a/mm/memory.c b/mm/memory.c
> index 68c1d962d0ad..99af83434e7c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -527,10 +527,11 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> dump_page(page, "bad pte");
> pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
> (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
> - pr_alert("file:%pD fault:%ps mmap:%ps read_folio:%ps\n",
> + pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
> vma->vm_file,
> vma->vm_ops ? vma->vm_ops->fault : NULL,
> vma->vm_file ? vma->vm_file->f_op->mmap : NULL,
> + vma->vm_file ? vma->vm_file->f_op->mmap_prepare : NULL,
> mapping ? mapping->a_ops->read_folio : NULL);
> dump_stack();
> add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 81dd962a1cfc..50f902c08341 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -475,7 +475,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> vm_flags &= ~VM_MAYEXEC;
> }
>
> - if (!file->f_op->mmap)
> + if (!file_has_valid_mmap_hooks(file))
> return -ENODEV;
> if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
> return -EINVAL;
> diff --git a/mm/vma.c b/mm/vma.c
> index 1f2634b29568..3f32e04bb6cc 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -17,6 +17,11 @@ struct mmap_state {
> unsigned long pglen;
> unsigned long flags;
> struct file *file;
> + pgprot_t page_prot;
> +
> + /* User-defined fields, perhaps updated by .mmap_prepare(). */
> + const struct vm_operations_struct *vm_ops;
> + void *vm_private_data;
>
> unsigned long charged;
> bool retry_merge;
> @@ -40,6 +45,7 @@ struct mmap_state {
> .pglen = PHYS_PFN(len_), \
> .flags = flags_, \
> .file = file_, \
> + .page_prot = vm_get_page_prot(flags_), \
> }
>
> #define VMG_MMAP_STATE(name, map_, vma_) \
> @@ -2385,6 +2391,10 @@ static int __mmap_new_file_vma(struct mmap_state *map,
> int error;
>
> vma->vm_file = get_file(map->file);
> +
> + if (!map->file->f_op->mmap)
> + return 0;
> +
> error = mmap_file(vma->vm_file, vma);
> if (error) {
> fput(vma->vm_file);
> @@ -2441,7 +2451,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> vma_iter_config(vmi, map->addr, map->end);
> vma_set_range(vma, map->addr, map->end, map->pgoff);
> vm_flags_init(vma, map->flags);
> - vma->vm_page_prot = vm_get_page_prot(map->flags);
> + vma->vm_page_prot = map->page_prot;
>
> if (vma_iter_prealloc(vmi, vma)) {
> error = -ENOMEM;
> @@ -2528,6 +2538,56 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
> vma_set_page_prot(vma);
> }
>
> +/*
> + * Invoke the f_op->mmap_prepare() callback for a file-backed mapping that
> + * specifies it.
> + *
> + * This is called prior to any merge attempt, and updates whitelisted fields
> + * that are permitted to be updated by the caller.
> + *
> + * All but user-defined fields will be pre-populated with original values.
> + *
> + * Returns 0 on success, or an error code otherwise.
> + */
> +static int call_mmap_prepare(struct mmap_state *map)
> +{
> + int err;
> + struct vm_area_desc desc = {
> + .mm = map->mm,
> + .start = map->addr,
> + .end = map->end,
> +
> + .pgoff = map->pgoff,
> + .file = map->file,
> + .vm_flags = map->flags,
> + .page_prot = map->page_prot,
> + };
> +
> + /* Invoke the hook. */
> + err = __call_mmap_prepare(map->file, &desc);
> + if (err)
> + return err;
> +
> + /* Update fields permitted to be changed. */
> + map->pgoff = desc.pgoff;
> + map->file = desc.file;
> + map->flags = desc.vm_flags;
> + map->page_prot = desc.page_prot;
> + /* User-defined fields. */
> + map->vm_ops = desc.vm_ops;
> + map->vm_private_data = desc.private_data;
> +
> + return 0;
> +}
> +
> +static void set_vma_user_defined_fields(struct vm_area_struct *vma,
> + struct mmap_state *map)
> +{
> + if (map->vm_ops)
> + vma->vm_ops = map->vm_ops;
> + vma->vm_private_data = map->vm_private_data;
> +}
> +
> static unsigned long __mmap_region(struct file *file, unsigned long addr,
> unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> struct list_head *uf)
> @@ -2535,10 +2595,13 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma = NULL;
> int error;
> + bool have_mmap_prepare = file && file->f_op->mmap_prepare;
> VMA_ITERATOR(vmi, mm, addr);
> MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
>
> error = __mmap_prepare(&map, uf);
> + if (!error && have_mmap_prepare)
> + error = call_mmap_prepare(&map);
> if (error)
> goto abort_munmap;
>
> @@ -2556,6 +2619,9 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> goto unacct_error;
> }
>
> + if (have_mmap_prepare)
> + set_vma_user_defined_fields(vma, &map);
> +
> /* If flags changed, we might be able to merge, so try again. */
> if (map.retry_merge) {
> struct vm_area_struct *merged;
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 198abe66de5a..f6e45e62da3a 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -253,8 +253,40 @@ struct mm_struct {
> unsigned long flags; /* Must use atomic bitops to access */
> };
>
> +struct vm_area_struct;
> +
> +/*
> + * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
> + * manipulate mutable fields which will cause those fields to be updated in the
> + * resultant VMA.
> + *
> + * Helper functions are not required for manipulating any field.
> + */
> +struct vm_area_desc {
> + /* Immutable state. */
> + struct mm_struct *mm;
> + unsigned long start;
> + unsigned long end;
> +
> + /* Mutable fields. Populated with initial state. */
> + pgoff_t pgoff;
> + struct file *file;
> + vm_flags_t vm_flags;
> + pgprot_t page_prot;
> +
> + /* Write-only fields. */
> + const struct vm_operations_struct *vm_ops;
> + void *private_data;
> +};
> +
> +struct file_operations {
> + int (*mmap)(struct file *, struct vm_area_struct *);
> + int (*mmap_prepare)(struct vm_area_desc *);
> +};
> +
> struct file {
> struct address_space *f_mapping;
> + const struct file_operations *f_op;
> };
>
> #define VMA_LOCK_OFFSET 0x40000000
> @@ -1125,11 +1157,6 @@ static inline void vm_flags_clear(struct vm_area_struct *vma,
> vma->__vm_flags &= ~flags;
> }
>
> -static inline int call_mmap(struct file *, struct vm_area_struct *)
> -{
> - return 0;
> -}
> -
> static inline int shmem_zero_setup(struct vm_area_struct *)
> {
> return 0;
> @@ -1405,4 +1432,33 @@ static inline void free_anon_vma_name(struct vm_area_struct *vma)
> (void)vma;
> }
>
> +/* Did the driver provide valid mmap hook configuration? */
> +static inline bool file_has_valid_mmap_hooks(struct file *file)
> +{
> + bool has_mmap = file->f_op->mmap;
> + bool has_mmap_prepare = file->f_op->mmap_prepare;
> +
> + /* Hooks are mutually exclusive. */
> + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
> + return false;
> + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
> + return false;
> +
> + return true;
> +}
> +
> +static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + if (WARN_ON_ONCE(file->f_op->mmap_prepare))
> + return -EINVAL;
> +
> + return file->f_op->mmap(file, vma);
> +}
> +
> +static inline int __call_mmap_prepare(struct file *file,
> + struct vm_area_desc *desc)
> +{
> + return file->f_op->mmap_prepare(desc);
> +}
> +
> #endif /* __MM_VMA_INTERNAL_H */
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-12 9:24 ` Christian Brauner
@ 2025-05-12 11:31 ` Lorenzo Stoakes
2025-05-13 7:29 ` Christian Brauner
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-12 11:31 UTC (permalink / raw)
To: Christian Brauner
Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm,
Alexander Viro, Jan Kara, Matthew Wilcox
On Mon, May 12, 2025 at 11:24:06AM +0200, Christian Brauner wrote:
> On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
[snip]
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 016b0fe1536e..e2721a1ff13d 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
[snip]
> > static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> > {
> > + if (WARN_ON_ONCE(file->f_op->mmap_prepare))
> > + return -EINVAL;
> > +
> > return file->f_op->mmap(file, vma);
> > }
> >
> > +static inline int __call_mmap_prepare(struct file *file,
> > + struct vm_area_desc *desc)
> > +{
> > + return file->f_op->mmap_prepare(desc);
> > +}
>
> nit: I would prefer if we could rename this to vfs_mmap() and
> vfs_mmap_prepare() as this is in line with all the other vfs related
> helpers we expose.
>
Happy to do it, but:
call_mmap() is already invoked in a bunch of places, so kinda falls outside this
series (+ would touch a bunch of unrelated files), would you mind if I sent that
separately?
Thanks!
[snip]
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
2025-05-12 9:24 ` Christian Brauner
@ 2025-05-12 13:37 ` Vlastimil Babka
2025-05-13 9:01 ` David Hildenbrand
` (2 subsequent siblings)
4 siblings, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2025-05-12 13:37 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: David Hildenbrand, Liam R . Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On 5/9/25 14:13, Lorenzo Stoakes wrote:
> Provide a means by which drivers can specify which fields of those
> permitted to be changed should be altered to prior to mmap()'ing a
> range (which may either result from a merge or from mapping an entirely new
> VMA).
>
> Doing so is substantially safer than the existing .mmap() calback which
> provides unrestricted access to the part-constructed VMA and permits
> drivers and file systems to do 'creative' things which makes it hard to
> reason about the state of the VMA after the function returns.
>
> The existing .mmap() callback's freedom has caused a great deal of issues,
> especially in error handling, as unwinding the mmap() state has proven to
> be non-trivial and caused significant issues in the past, for instance
> those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> error path behaviour").
>
> It also necessitates a second attempt at merge once the .mmap() callback
> has completed, which has caused issues in the past, is awkward, adds
> overhead and is difficult to reason about.
>
> The .mmap_prepare() callback eliminates this requirement, as we can update
> fields prior to even attempting the first merge. It is safer, as we heavily
> restrict what can actually be modified, and being invoked very early in the
> mmap() process, error handling can be performed safely with very little
> unwinding of state required.
>
> The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> exclusive, so we permit only one to be invoked at a time.
>
> Update vma userland test stubs to account for changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook
2025-05-09 12:13 ` [PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook Lorenzo Stoakes
@ 2025-05-12 13:41 ` Vlastimil Babka
2025-05-13 13:23 ` Liam R. Howlett
1 sibling, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2025-05-12 13:41 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: David Hildenbrand, Liam R . Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On 5/9/25 14:13, Lorenzo Stoakes wrote:
> Secretmem has a simple .mmap() hook which is easily converted to the new
> .mmap_prepare() callback.
>
> Importantly, it's a rare instance of an driver that manipulates a VMA which
> is mergeable (that is, not a VM_SPECIAL mapping) while also adjusting VMA
> flags which may adjust mergeability, meaning the retry merge logic might
> impact whether or not the VMA is merged.
>
> By using .mmap_prepare() there's no longer any need to retry the merge
> later as we can simply set the correct flags from the start.
>
> This change therefore allows us to remove the retry merge logic in a
> subsequent commit.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] mm/vma: remove mmap() retry merge
2025-05-09 12:13 ` [PATCH v2 3/3] mm/vma: remove mmap() retry merge Lorenzo Stoakes
@ 2025-05-12 13:44 ` Vlastimil Babka
2025-05-13 13:25 ` Liam R. Howlett
1 sibling, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2025-05-12 13:44 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: David Hildenbrand, Liam R . Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On 5/9/25 14:13, Lorenzo Stoakes wrote:
> We have now introduced a mechanism that obviates the need for a reattempted
> merge via the mmap_prepare() file hook, so eliminate this functionality
> altogether.
>
> The retry merge logic has been the cause of a great deal of complexity in
> the past and required a great deal of careful manoeuvring of code to ensure
> its continued and correct functionality.
>
> It has also recently been involved in an issue surrounding maple tree
> state, which again points to its problematic nature.
>
> We make it much easier to reason about mmap() logic by eliminating this and
> simply writing a VMA once. This also opens the doors to future optimisation
> and improvement in the mmap() logic.
>
> For any device or file system which encounters unwanted VMA fragmentation
> as a result of this change (that is, having not implemented .mmap_prepare
> hooks), the issue is easily resolvable by doing so.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-12 11:31 ` Lorenzo Stoakes
@ 2025-05-13 7:29 ` Christian Brauner
0 siblings, 0 replies; 23+ messages in thread
From: Christian Brauner @ 2025-05-13 7:29 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm,
Alexander Viro, Jan Kara, Matthew Wilcox
On Mon, May 12, 2025 at 12:31:34PM +0100, Lorenzo Stoakes wrote:
> On Mon, May 12, 2025 at 11:24:06AM +0200, Christian Brauner wrote:
> > On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
>
> [snip]
>
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 016b0fe1536e..e2721a1ff13d 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
>
> [snip]
>
> > > static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> > > {
> > > + if (WARN_ON_ONCE(file->f_op->mmap_prepare))
> > > + return -EINVAL;
> > > +
> > > return file->f_op->mmap(file, vma);
> > > }
> > >
> > > +static inline int __call_mmap_prepare(struct file *file,
> > > + struct vm_area_desc *desc)
> > > +{
> > > + return file->f_op->mmap_prepare(desc);
> > > +}
> >
> > nit: I would prefer if we could rename this to vfs_mmap() and
> > vfs_mmap_prepare() as this is in line with all the other vfs related
> > helpers we expose.
> >
>
> Happy to do it, but:
>
> call_mmap() is already invoked in a bunch of places, so kinda falls outside this
> series (+ would touch a bunch of unrelated files), would you mind if I sent that
> separately?
Sure, that's fine.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
2025-05-12 9:24 ` Christian Brauner
2025-05-12 13:37 ` Vlastimil Babka
@ 2025-05-13 9:01 ` David Hildenbrand
2025-05-13 9:32 ` Lorenzo Stoakes
2025-05-13 13:22 ` Liam R. Howlett
2025-05-14 9:04 ` Pedro Falcato
4 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-05-13 9:01 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On 09.05.25 14:13, Lorenzo Stoakes wrote:
> Provide a means by which drivers can specify which fields of those
> permitted to be changed should be altered to prior to mmap()'ing a
> range (which may either result from a merge or from mapping an entirely new
> VMA).
>
> Doing so is substantially safer than the existing .mmap() calback which
> provides unrestricted access to the part-constructed VMA and permits
> drivers and file systems to do 'creative' things which makes it hard to
> reason about the state of the VMA after the function returns.
>
> The existing .mmap() callback's freedom has caused a great deal of issues,
> especially in error handling, as unwinding the mmap() state has proven to
> be non-trivial and caused significant issues in the past, for instance
> those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> error path behaviour").
>
> It also necessitates a second attempt at merge once the .mmap() callback
> has completed, which has caused issues in the past, is awkward, adds
> overhead and is difficult to reason about.
>
> The .mmap_prepare() callback eliminates this requirement, as we can update
> fields prior to even attempting the first merge. It is safer, as we heavily
> restrict what can actually be modified, and being invoked very early in the
> mmap() process, error handling can be performed safely with very little
> unwinding of state required.
>
> The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> exclusive, so we permit only one to be invoked at a time.
>
> Update vma userland test stubs to account for changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> include/linux/fs.h | 25 ++++++++++++
> include/linux/mm_types.h | 24 +++++++++++
> mm/memory.c | 3 +-
> mm/mmap.c | 2 +-
> mm/vma.c | 68 +++++++++++++++++++++++++++++++-
> tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++++---
> 6 files changed, 180 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 016b0fe1536e..e2721a1ff13d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2169,6 +2169,7 @@ struct file_operations {
> int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
> int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> unsigned int poll_flags);
> + int (*mmap_prepare)(struct vm_area_desc *);
> } __randomize_layout;
>
> /* Supports async buffered reads */
> @@ -2238,11 +2239,35 @@ struct inode_operations {
> struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
> } ____cacheline_aligned;
>
> +/* Did the driver provide valid mmap hook configuration? */
> +static inline bool file_has_valid_mmap_hooks(struct file *file)
> +{
> + bool has_mmap = file->f_op->mmap;
> + bool has_mmap_prepare = file->f_op->mmap_prepare;
> +
> + /* Hooks are mutually exclusive. */
> + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
> + return false;
> + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
> + return false;
> +
> + return true;
> +}
So, if neither is set, it's also an invalid setting, understood.
So we want XOR.
const bool has_mmap = file->f_op->mmap;
const bool has_mmap_prepare = file->f_op->mmap_prepare;
const bool mutual_exclusive = has_mmap ^ has_mmap_prepare;
WARN_ON_ONCE(!mutual_exclusive)
return mutual_exclusive;
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-13 9:01 ` David Hildenbrand
@ 2025-05-13 9:32 ` Lorenzo Stoakes
2025-05-13 13:25 ` David Hildenbrand
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-13 9:32 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On Tue, May 13, 2025 at 11:01:41AM +0200, David Hildenbrand wrote:
> On 09.05.25 14:13, Lorenzo Stoakes wrote:
> > Provide a means by which drivers can specify which fields of those
> > permitted to be changed should be altered to prior to mmap()'ing a
> > range (which may either result from a merge or from mapping an entirely new
> > VMA).
> >
> > Doing so is substantially safer than the existing .mmap() calback which
> > provides unrestricted access to the part-constructed VMA and permits
> > drivers and file systems to do 'creative' things which makes it hard to
> > reason about the state of the VMA after the function returns.
> >
> > The existing .mmap() callback's freedom has caused a great deal of issues,
> > especially in error handling, as unwinding the mmap() state has proven to
> > be non-trivial and caused significant issues in the past, for instance
> > those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> > error path behaviour").
> >
> > It also necessitates a second attempt at merge once the .mmap() callback
> > has completed, which has caused issues in the past, is awkward, adds
> > overhead and is difficult to reason about.
> >
> > The .mmap_prepare() callback eliminates this requirement, as we can update
> > fields prior to even attempting the first merge. It is safer, as we heavily
> > restrict what can actually be modified, and being invoked very early in the
> > mmap() process, error handling can be performed safely with very little
> > unwinding of state required.
> >
> > The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> > exclusive, so we permit only one to be invoked at a time.
> >
> > Update vma userland test stubs to account for changes.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> > include/linux/fs.h | 25 ++++++++++++
> > include/linux/mm_types.h | 24 +++++++++++
> > mm/memory.c | 3 +-
> > mm/mmap.c | 2 +-
> > mm/vma.c | 68 +++++++++++++++++++++++++++++++-
> > tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++++---
> > 6 files changed, 180 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 016b0fe1536e..e2721a1ff13d 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2169,6 +2169,7 @@ struct file_operations {
> > int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
> > int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> > unsigned int poll_flags);
> > + int (*mmap_prepare)(struct vm_area_desc *);
> > } __randomize_layout;
> > /* Supports async buffered reads */
> > @@ -2238,11 +2239,35 @@ struct inode_operations {
> > struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
> > } ____cacheline_aligned;
> > +/* Did the driver provide valid mmap hook configuration? */
> > +static inline bool file_has_valid_mmap_hooks(struct file *file)
> > +{
> > + bool has_mmap = file->f_op->mmap;
> > + bool has_mmap_prepare = file->f_op->mmap_prepare;
> > +
> > + /* Hooks are mutually exclusive. */
> > + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
> > + return false;
> > + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
> > + return false;
> > +
> > + return true;
> > +}
>
> So, if neither is set, it's also an invalid setting, understood.
>
> So we want XOR.
>
>
>
> const bool has_mmap = file->f_op->mmap;
> const bool has_mmap_prepare = file->f_op->mmap_prepare;
> const bool mutual_exclusive = has_mmap ^ has_mmap_prepare;
>
> WARN_ON_ONCE(!mutual_exclusive)
> return mutual_exclusive;
Yeah I did consider xor like this but I've always found it quite confusing
in this kind of context, honestly.
In a way I think it's a bit easier spelt out as it is now. But happy to
change if you feel strongly about it? :)
>
> --
> Cheers,
>
> David / dhildenb
>
Thanks!
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
` (2 preceding siblings ...)
2025-05-13 9:01 ` David Hildenbrand
@ 2025-05-13 13:22 ` Liam R. Howlett
2025-05-14 9:04 ` Pedro Falcato
4 siblings, 0 replies; 23+ messages in thread
From: Liam R. Howlett @ 2025-05-13 13:22 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250509 08:14]:
> Provide a means by which drivers can specify which fields of those
> permitted to be changed should be altered to prior to mmap()'ing a
> range (which may either result from a merge or from mapping an entirely new
> VMA).
>
> Doing so is substantially safer than the existing .mmap() calback which
> provides unrestricted access to the part-constructed VMA and permits
> drivers and file systems to do 'creative' things which makes it hard to
> reason about the state of the VMA after the function returns.
>
> The existing .mmap() callback's freedom has caused a great deal of issues,
> especially in error handling, as unwinding the mmap() state has proven to
> be non-trivial and caused significant issues in the past, for instance
> those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> error path behaviour").
>
> It also necessitates a second attempt at merge once the .mmap() callback
> has completed, which has caused issues in the past, is awkward, adds
> overhead and is difficult to reason about.
>
> The .mmap_prepare() callback eliminates this requirement, as we can update
> fields prior to even attempting the first merge. It is safer, as we heavily
> restrict what can actually be modified, and being invoked very early in the
> mmap() process, error handling can be performed safely with very little
> unwinding of state required.
>
> The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> exclusive, so we permit only one to be invoked at a time.
>
> Update vma userland test stubs to account for changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> include/linux/fs.h | 25 ++++++++++++
> include/linux/mm_types.h | 24 +++++++++++
> mm/memory.c | 3 +-
> mm/mmap.c | 2 +-
> mm/vma.c | 68 +++++++++++++++++++++++++++++++-
> tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++++---
> 6 files changed, 180 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 016b0fe1536e..e2721a1ff13d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2169,6 +2169,7 @@ struct file_operations {
> int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
> int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> unsigned int poll_flags);
> + int (*mmap_prepare)(struct vm_area_desc *);
> } __randomize_layout;
>
> /* Supports async buffered reads */
> @@ -2238,11 +2239,35 @@ struct inode_operations {
> struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
> } ____cacheline_aligned;
>
> +/* Did the driver provide valid mmap hook configuration? */
> +static inline bool file_has_valid_mmap_hooks(struct file *file)
> +{
> + bool has_mmap = file->f_op->mmap;
> + bool has_mmap_prepare = file->f_op->mmap_prepare;
> +
> + /* Hooks are mutually exclusive. */
> + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
> + return false;
> + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
> + return false;
> +
> + return true;
> +}
> +
> static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> {
> + if (WARN_ON_ONCE(file->f_op->mmap_prepare))
> + return -EINVAL;
> +
> return file->f_op->mmap(file, vma);
> }
>
> +static inline int __call_mmap_prepare(struct file *file,
> + struct vm_area_desc *desc)
> +{
> + return file->f_op->mmap_prepare(desc);
> +}
> +
> extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
> extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
> extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e76bade9ebb1..15808cad2bc1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -763,6 +763,30 @@ struct vma_numab_state {
> int prev_scan_seq;
> };
>
> +/*
> + * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
> + * manipulate mutable fields which will cause those fields to be updated in the
> + * resultant VMA.
> + *
> + * Helper functions are not required for manipulating any field.
> + */
> +struct vm_area_desc {
> + /* Immutable state. */
> + struct mm_struct *mm;
> + unsigned long start;
> + unsigned long end;
> +
> + /* Mutable fields. Populated with initial state. */
> + pgoff_t pgoff;
> + struct file *file;
> + vm_flags_t vm_flags;
> + pgprot_t page_prot;
> +
> + /* Write-only fields. */
> + const struct vm_operations_struct *vm_ops;
> + void *private_data;
> +};
> +
> /*
> * This struct describes a virtual memory area. There is one of these
> * per VM-area/task. A VM area is any part of the process virtual memory
> diff --git a/mm/memory.c b/mm/memory.c
> index 68c1d962d0ad..99af83434e7c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -527,10 +527,11 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> dump_page(page, "bad pte");
> pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
> (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
> - pr_alert("file:%pD fault:%ps mmap:%ps read_folio:%ps\n",
> + pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
> vma->vm_file,
> vma->vm_ops ? vma->vm_ops->fault : NULL,
> vma->vm_file ? vma->vm_file->f_op->mmap : NULL,
> + vma->vm_file ? vma->vm_file->f_op->mmap_prepare : NULL,
> mapping ? mapping->a_ops->read_folio : NULL);
> dump_stack();
> add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 81dd962a1cfc..50f902c08341 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -475,7 +475,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> vm_flags &= ~VM_MAYEXEC;
> }
>
> - if (!file->f_op->mmap)
> + if (!file_has_valid_mmap_hooks(file))
> return -ENODEV;
> if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
> return -EINVAL;
> diff --git a/mm/vma.c b/mm/vma.c
> index 1f2634b29568..3f32e04bb6cc 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -17,6 +17,11 @@ struct mmap_state {
> unsigned long pglen;
> unsigned long flags;
> struct file *file;
> + pgprot_t page_prot;
> +
> + /* User-defined fields, perhaps updated by .mmap_prepare(). */
> + const struct vm_operations_struct *vm_ops;
> + void *vm_private_data;
>
> unsigned long charged;
> bool retry_merge;
> @@ -40,6 +45,7 @@ struct mmap_state {
> .pglen = PHYS_PFN(len_), \
> .flags = flags_, \
> .file = file_, \
> + .page_prot = vm_get_page_prot(flags_), \
> }
>
> #define VMG_MMAP_STATE(name, map_, vma_) \
> @@ -2385,6 +2391,10 @@ static int __mmap_new_file_vma(struct mmap_state *map,
> int error;
>
> vma->vm_file = get_file(map->file);
> +
> + if (!map->file->f_op->mmap)
> + return 0;
> +
> error = mmap_file(vma->vm_file, vma);
> if (error) {
> fput(vma->vm_file);
> @@ -2441,7 +2451,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> vma_iter_config(vmi, map->addr, map->end);
> vma_set_range(vma, map->addr, map->end, map->pgoff);
> vm_flags_init(vma, map->flags);
> - vma->vm_page_prot = vm_get_page_prot(map->flags);
> + vma->vm_page_prot = map->page_prot;
>
> if (vma_iter_prealloc(vmi, vma)) {
> error = -ENOMEM;
> @@ -2528,6 +2538,56 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
> vma_set_page_prot(vma);
> }
>
> +/*
> + * Invoke the f_op->mmap_prepare() callback for a file-backed mapping that
> + * specifies it.
> + *
> + * This is called prior to any merge attempt, and updates whitelisted fields
> + * that are permitted to be updated by the caller.
> + *
> + * All but user-defined fields will be pre-populated with original values.
> + *
> + * Returns 0 on success, or an error code otherwise.
> + */
> +static int call_mmap_prepare(struct mmap_state *map)
> +{
> + int err;
> + struct vm_area_desc desc = {
> + .mm = map->mm,
> + .start = map->addr,
> + .end = map->end,
> +
> + .pgoff = map->pgoff,
> + .file = map->file,
> + .vm_flags = map->flags,
> + .page_prot = map->page_prot,
> + };
> +
> + /* Invoke the hook. */
> + err = __call_mmap_prepare(map->file, &desc);
> + if (err)
> + return err;
> +
> + /* Update fields permitted to be changed. */
> + map->pgoff = desc.pgoff;
> + map->file = desc.file;
> + map->flags = desc.vm_flags;
> + map->page_prot = desc.page_prot;
> + /* User-defined fields. */
> + map->vm_ops = desc.vm_ops;
> + map->vm_private_data = desc.private_data;
> +
> + return 0;
> +}
> +
> +static void set_vma_user_defined_fields(struct vm_area_struct *vma,
> + struct mmap_state *map)
> +{
> + if (map->vm_ops)
> + vma->vm_ops = map->vm_ops;
> + vma->vm_private_data = map->vm_private_data;
> +}
> +
> static unsigned long __mmap_region(struct file *file, unsigned long addr,
> unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> struct list_head *uf)
> @@ -2535,10 +2595,13 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma = NULL;
> int error;
> + bool have_mmap_prepare = file && file->f_op->mmap_prepare;
> VMA_ITERATOR(vmi, mm, addr);
> MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
>
> error = __mmap_prepare(&map, uf);
> + if (!error && have_mmap_prepare)
> + error = call_mmap_prepare(&map);
> if (error)
> goto abort_munmap;
>
> @@ -2556,6 +2619,9 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> goto unacct_error;
> }
>
> + if (have_mmap_prepare)
> + set_vma_user_defined_fields(vma, &map);
> +
> /* If flags changed, we might be able to merge, so try again. */
> if (map.retry_merge) {
> struct vm_area_struct *merged;
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 198abe66de5a..f6e45e62da3a 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -253,8 +253,40 @@ struct mm_struct {
> unsigned long flags; /* Must use atomic bitops to access */
> };
>
> +struct vm_area_struct;
> +
> +/*
> + * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
> + * manipulate mutable fields which will cause those fields to be updated in the
> + * resultant VMA.
> + *
> + * Helper functions are not required for manipulating any field.
> + */
> +struct vm_area_desc {
> + /* Immutable state. */
> + struct mm_struct *mm;
> + unsigned long start;
> + unsigned long end;
> +
> + /* Mutable fields. Populated with initial state. */
> + pgoff_t pgoff;
> + struct file *file;
> + vm_flags_t vm_flags;
> + pgprot_t page_prot;
> +
> + /* Write-only fields. */
> + const struct vm_operations_struct *vm_ops;
> + void *private_data;
> +};
> +
> +struct file_operations {
> + int (*mmap)(struct file *, struct vm_area_struct *);
> + int (*mmap_prepare)(struct vm_area_desc *);
> +};
> +
> struct file {
> struct address_space *f_mapping;
> + const struct file_operations *f_op;
> };
>
> #define VMA_LOCK_OFFSET 0x40000000
> @@ -1125,11 +1157,6 @@ static inline void vm_flags_clear(struct vm_area_struct *vma,
> vma->__vm_flags &= ~flags;
> }
>
> -static inline int call_mmap(struct file *, struct vm_area_struct *)
> -{
> - return 0;
> -}
> -
> static inline int shmem_zero_setup(struct vm_area_struct *)
> {
> return 0;
> @@ -1405,4 +1432,33 @@ static inline void free_anon_vma_name(struct vm_area_struct *vma)
> (void)vma;
> }
>
> +/* Did the driver provide valid mmap hook configuration? */
> +static inline bool file_has_valid_mmap_hooks(struct file *file)
> +{
> + bool has_mmap = file->f_op->mmap;
> + bool has_mmap_prepare = file->f_op->mmap_prepare;
> +
> + /* Hooks are mutually exclusive. */
> + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
> + return false;
> + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
> + return false;
> +
> + return true;
> +}
> +
> +static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + if (WARN_ON_ONCE(file->f_op->mmap_prepare))
> + return -EINVAL;
> +
> + return file->f_op->mmap(file, vma);
> +}
> +
> +static inline int __call_mmap_prepare(struct file *file,
> + struct vm_area_desc *desc)
> +{
> + return file->f_op->mmap_prepare(desc);
> +}
> +
> #endif /* __MM_VMA_INTERNAL_H */
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook
2025-05-09 12:13 ` [PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook Lorenzo Stoakes
2025-05-12 13:41 ` Vlastimil Babka
@ 2025-05-13 13:23 ` Liam R. Howlett
2025-05-13 15:32 ` Suren Baghdasaryan
1 sibling, 1 reply; 23+ messages in thread
From: Liam R. Howlett @ 2025-05-13 13:23 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250509 08:14]:
> Secretmem has a simple .mmap() hook which is easily converted to the new
> .mmap_prepare() callback.
>
> Importantly, it's a rare instance of an driver that manipulates a VMA which
> is mergeable (that is, not a VM_SPECIAL mapping) while also adjusting VMA
> flags which may adjust mergeability, meaning the retry merge logic might
> impact whether or not the VMA is merged.
>
> By using .mmap_prepare() there's no longer any need to retry the merge
> later as we can simply set the correct flags from the start.
>
> This change therefore allows us to remove the retry merge logic in a
> subsequent commit.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> mm/secretmem.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 1b0a214ee558..589b26c2d553 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -120,18 +120,18 @@ static int secretmem_release(struct inode *inode, struct file *file)
> return 0;
> }
>
> -static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
> +static int secretmem_mmap_prepare(struct vm_area_desc *desc)
> {
> - unsigned long len = vma->vm_end - vma->vm_start;
> + const unsigned long len = desc->end - desc->start;
>
> - if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> + if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> return -EINVAL;
>
> - if (!mlock_future_ok(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
> + if (!mlock_future_ok(desc->mm, desc->vm_flags | VM_LOCKED, len))
> return -EAGAIN;
>
> - vm_flags_set(vma, VM_LOCKED | VM_DONTDUMP);
> - vma->vm_ops = &secretmem_vm_ops;
> + desc->vm_flags |= VM_LOCKED | VM_DONTDUMP;
> + desc->vm_ops = &secretmem_vm_ops;
>
> return 0;
> }
> @@ -143,7 +143,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma)
>
> static const struct file_operations secretmem_fops = {
> .release = secretmem_release,
> - .mmap = secretmem_mmap,
> + .mmap_prepare = secretmem_mmap_prepare,
> };
>
> static int secretmem_migrate_folio(struct address_space *mapping,
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-13 9:32 ` Lorenzo Stoakes
@ 2025-05-13 13:25 ` David Hildenbrand
2025-05-13 15:31 ` Suren Baghdasaryan
0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-05-13 13:25 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On 13.05.25 11:32, Lorenzo Stoakes wrote:
> On Tue, May 13, 2025 at 11:01:41AM +0200, David Hildenbrand wrote:
>> On 09.05.25 14:13, Lorenzo Stoakes wrote:
>>> Provide a means by which drivers can specify which fields of those
>>> permitted to be changed should be altered to prior to mmap()'ing a
>>> range (which may either result from a merge or from mapping an entirely new
>>> VMA).
>>>
>>> Doing so is substantially safer than the existing .mmap() calback which
>>> provides unrestricted access to the part-constructed VMA and permits
>>> drivers and file systems to do 'creative' things which makes it hard to
>>> reason about the state of the VMA after the function returns.
>>>
>>> The existing .mmap() callback's freedom has caused a great deal of issues,
>>> especially in error handling, as unwinding the mmap() state has proven to
>>> be non-trivial and caused significant issues in the past, for instance
>>> those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
>>> error path behaviour").
>>>
>>> It also necessitates a second attempt at merge once the .mmap() callback
>>> has completed, which has caused issues in the past, is awkward, adds
>>> overhead and is difficult to reason about.
>>>
>>> The .mmap_prepare() callback eliminates this requirement, as we can update
>>> fields prior to even attempting the first merge. It is safer, as we heavily
>>> restrict what can actually be modified, and being invoked very early in the
>>> mmap() process, error handling can be performed safely with very little
>>> unwinding of state required.
>>>
>>> The .mmap_prepare() and deprecated .mmap() callbacks are mutually
>>> exclusive, so we permit only one to be invoked at a time.
>>>
>>> Update vma userland test stubs to account for changes.
>>>
>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> ---
>>> include/linux/fs.h | 25 ++++++++++++
>>> include/linux/mm_types.h | 24 +++++++++++
>>> mm/memory.c | 3 +-
>>> mm/mmap.c | 2 +-
>>> mm/vma.c | 68 +++++++++++++++++++++++++++++++-
>>> tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++++---
>>> 6 files changed, 180 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>> index 016b0fe1536e..e2721a1ff13d 100644
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -2169,6 +2169,7 @@ struct file_operations {
>>> int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
>>> int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
>>> unsigned int poll_flags);
>>> + int (*mmap_prepare)(struct vm_area_desc *);
>>> } __randomize_layout;
>>> /* Supports async buffered reads */
>>> @@ -2238,11 +2239,35 @@ struct inode_operations {
>>> struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
>>> } ____cacheline_aligned;
>>> +/* Did the driver provide valid mmap hook configuration? */
>>> +static inline bool file_has_valid_mmap_hooks(struct file *file)
>>> +{
>>> + bool has_mmap = file->f_op->mmap;
>>> + bool has_mmap_prepare = file->f_op->mmap_prepare;
>>> +
>>> + /* Hooks are mutually exclusive. */
>>> + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
>>> + return false;
>>> + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
>>> + return false;
>>> +
>>> + return true;
>>> +}
>>
>> So, if neither is set, it's also an invalid setting, understood.
>>
>> So we want XOR.
>>
>>
>>
>> const bool has_mmap = file->f_op->mmap;
>> const bool has_mmap_prepare = file->f_op->mmap_prepare;
>> const bool mutual_exclusive = has_mmap ^ has_mmap_prepare;
>>
>> WARN_ON_ONCE(!mutual_exclusive)
>> return mutual_exclusive;
>
> Yeah I did consider xor like this but I've always found it quite confusing
> in this kind of context, honestly.
With the local variable I think it's quite helpful (no need for a
comment :P ).
>
> In a way I think it's a bit easier spelt out as it is now. But happy to
> change if you feel strongly about it? :)
Certainly not strongly! :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] mm/vma: remove mmap() retry merge
2025-05-09 12:13 ` [PATCH v2 3/3] mm/vma: remove mmap() retry merge Lorenzo Stoakes
2025-05-12 13:44 ` Vlastimil Babka
@ 2025-05-13 13:25 ` Liam R. Howlett
2025-05-13 15:38 ` Suren Baghdasaryan
1 sibling, 1 reply; 23+ messages in thread
From: Liam R. Howlett @ 2025-05-13 13:25 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250509 08:14]:
> We have now introduced a mechanism that obviates the need for a reattempted
> merge via the mmap_prepare() file hook, so eliminate this functionality
> altogether.
>
> The retry merge logic has been the cause of a great deal of complexity in
> the past and required a great deal of careful manoeuvring of code to ensure
> its continued and correct functionality.
>
> It has also recently been involved in an issue surrounding maple tree
> state, which again points to its problematic nature.
>
> We make it much easier to reason about mmap() logic by eliminating this and
> simply writing a VMA once. This also opens the doors to future optimisation
> and improvement in the mmap() logic.
>
> For any device or file system which encounters unwanted VMA fragmentation
> as a result of this change (that is, having not implemented .mmap_prepare
> hooks), the issue is easily resolvable by doing so.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
I have a few tests for the vma test suite that would test this path.
I'll just let them go.
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> mm/vma.c | 14 --------------
> 1 file changed, 14 deletions(-)
>
> diff --git a/mm/vma.c b/mm/vma.c
> index 3f32e04bb6cc..3ff6cfbe3338 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -24,7 +24,6 @@ struct mmap_state {
> void *vm_private_data;
>
> unsigned long charged;
> - bool retry_merge;
>
> struct vm_area_struct *prev;
> struct vm_area_struct *next;
> @@ -2417,8 +2416,6 @@ static int __mmap_new_file_vma(struct mmap_state *map,
> !(map->flags & VM_MAYWRITE) &&
> (vma->vm_flags & VM_MAYWRITE));
>
> - /* If the flags change (and are mergeable), let's retry later. */
> - map->retry_merge = vma->vm_flags != map->flags && !(vma->vm_flags & VM_SPECIAL);
> map->flags = vma->vm_flags;
>
> return 0;
> @@ -2622,17 +2619,6 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> if (have_mmap_prepare)
> set_vma_user_defined_fields(vma, &map);
>
> - /* If flags changed, we might be able to merge, so try again. */
> - if (map.retry_merge) {
> - struct vm_area_struct *merged;
> - VMG_MMAP_STATE(vmg, &map, vma);
> -
> - vma_iter_config(map.vmi, map.addr, map.end);
> - merged = vma_merge_existing_range(&vmg);
> - if (merged)
> - vma = merged;
> - }
> -
> __mmap_complete(&map, vma);
>
> return addr;
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-13 13:25 ` David Hildenbrand
@ 2025-05-13 15:31 ` Suren Baghdasaryan
0 siblings, 0 replies; 23+ messages in thread
From: Suren Baghdasaryan @ 2025-05-13 15:31 UTC (permalink / raw)
To: David Hildenbrand
Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Michal Hocko, Jann Horn,
Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm,
Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox
On Tue, May 13, 2025 at 6:25 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.05.25 11:32, Lorenzo Stoakes wrote:
> > On Tue, May 13, 2025 at 11:01:41AM +0200, David Hildenbrand wrote:
> >> On 09.05.25 14:13, Lorenzo Stoakes wrote:
> >>> Provide a means by which drivers can specify which fields of those
> >>> permitted to be changed should be altered to prior to mmap()'ing a
> >>> range (which may either result from a merge or from mapping an entirely new
> >>> VMA).
> >>>
> >>> Doing so is substantially safer than the existing .mmap() calback which
> >>> provides unrestricted access to the part-constructed VMA and permits
> >>> drivers and file systems to do 'creative' things which makes it hard to
> >>> reason about the state of the VMA after the function returns.
> >>>
> >>> The existing .mmap() callback's freedom has caused a great deal of issues,
> >>> especially in error handling, as unwinding the mmap() state has proven to
> >>> be non-trivial and caused significant issues in the past, for instance
> >>> those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> >>> error path behaviour").
> >>>
> >>> It also necessitates a second attempt at merge once the .mmap() callback
> >>> has completed, which has caused issues in the past, is awkward, adds
> >>> overhead and is difficult to reason about.
> >>>
> >>> The .mmap_prepare() callback eliminates this requirement, as we can update
> >>> fields prior to even attempting the first merge. It is safer, as we heavily
> >>> restrict what can actually be modified, and being invoked very early in the
> >>> mmap() process, error handling can be performed safely with very little
> >>> unwinding of state required.
> >>>
> >>> The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> >>> exclusive, so we permit only one to be invoked at a time.
> >>>
> >>> Update vma userland test stubs to account for changes.
> >>>
> >>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> >>> ---
> >>> include/linux/fs.h | 25 ++++++++++++
> >>> include/linux/mm_types.h | 24 +++++++++++
> >>> mm/memory.c | 3 +-
> >>> mm/mmap.c | 2 +-
> >>> mm/vma.c | 68 +++++++++++++++++++++++++++++++-
> >>> tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++++---
> >>> 6 files changed, 180 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >>> index 016b0fe1536e..e2721a1ff13d 100644
> >>> --- a/include/linux/fs.h
> >>> +++ b/include/linux/fs.h
> >>> @@ -2169,6 +2169,7 @@ struct file_operations {
> >>> int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
> >>> int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> >>> unsigned int poll_flags);
> >>> + int (*mmap_prepare)(struct vm_area_desc *);
> >>> } __randomize_layout;
> >>> /* Supports async buffered reads */
> >>> @@ -2238,11 +2239,35 @@ struct inode_operations {
> >>> struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
> >>> } ____cacheline_aligned;
> >>> +/* Did the driver provide valid mmap hook configuration? */
> >>> +static inline bool file_has_valid_mmap_hooks(struct file *file)
> >>> +{
> >>> + bool has_mmap = file->f_op->mmap;
> >>> + bool has_mmap_prepare = file->f_op->mmap_prepare;
> >>> +
> >>> + /* Hooks are mutually exclusive. */
> >>> + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare))
> >>> + return false;
> >>> + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare))
> >>> + return false;
> >>> +
> >>> + return true;
> >>> +}
> >>
> >> So, if neither is set, it's also an invalid setting, understood.
> >>
> >> So we want XOR.
> >>
> >>
> >>
> >> const bool has_mmap = file->f_op->mmap;
> >> const bool has_mmap_prepare = file->f_op->mmap_prepare;
> >> const bool mutual_exclusive = has_mmap ^ has_mmap_prepare;
> >>
> >> WARN_ON_ONCE(!mutual_exclusive)
> >> return mutual_exclusive;
> >
> > Yeah I did consider xor like this but I've always found it quite confusing
> > in this kind of context, honestly.
>
> With the local variable I think it's quite helpful (no need for a
> comment :P ).
>
> >
> > In a way I think it's a bit easier spelt out as it is now. But happy to
> > change if you feel strongly about it? :)
>
> Certainly not strongly! :)
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook
2025-05-13 13:23 ` Liam R. Howlett
@ 2025-05-13 15:32 ` Suren Baghdasaryan
0 siblings, 0 replies; 23+ messages in thread
From: Suren Baghdasaryan @ 2025-05-13 15:32 UTC (permalink / raw)
To: Liam R. Howlett, Lorenzo Stoakes, Andrew Morton,
David Hildenbrand, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On Tue, May 13, 2025 at 6:23 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250509 08:14]:
> > Secretmem has a simple .mmap() hook which is easily converted to the new
> > .mmap_prepare() callback.
> >
> > Importantly, it's a rare instance of an driver that manipulates a VMA which
> > is mergeable (that is, not a VM_SPECIAL mapping) while also adjusting VMA
> > flags which may adjust mergeability, meaning the retry merge logic might
> > impact whether or not the VMA is merged.
> >
> > By using .mmap_prepare() there's no longer any need to retry the merge
> > later as we can simply set the correct flags from the start.
> >
> > This change therefore allows us to remove the retry merge logic in a
> > subsequent commit.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Reviewed-by: David Hildenbrand <david@redhat.com>
>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>
> > ---
> > mm/secretmem.c | 14 +++++++-------
> > 1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/secretmem.c b/mm/secretmem.c
> > index 1b0a214ee558..589b26c2d553 100644
> > --- a/mm/secretmem.c
> > +++ b/mm/secretmem.c
> > @@ -120,18 +120,18 @@ static int secretmem_release(struct inode *inode, struct file *file)
> > return 0;
> > }
> >
> > -static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
> > +static int secretmem_mmap_prepare(struct vm_area_desc *desc)
> > {
> > - unsigned long len = vma->vm_end - vma->vm_start;
> > + const unsigned long len = desc->end - desc->start;
> >
> > - if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> > + if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> > return -EINVAL;
> >
> > - if (!mlock_future_ok(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
> > + if (!mlock_future_ok(desc->mm, desc->vm_flags | VM_LOCKED, len))
> > return -EAGAIN;
> >
> > - vm_flags_set(vma, VM_LOCKED | VM_DONTDUMP);
> > - vma->vm_ops = &secretmem_vm_ops;
> > + desc->vm_flags |= VM_LOCKED | VM_DONTDUMP;
> > + desc->vm_ops = &secretmem_vm_ops;
> >
> > return 0;
> > }
> > @@ -143,7 +143,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma)
> >
> > static const struct file_operations secretmem_fops = {
> > .release = secretmem_release,
> > - .mmap = secretmem_mmap,
> > + .mmap_prepare = secretmem_mmap_prepare,
> > };
> >
> > static int secretmem_migrate_folio(struct address_space *mapping,
> > --
> > 2.49.0
> >
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 3/3] mm/vma: remove mmap() retry merge
2025-05-13 13:25 ` Liam R. Howlett
@ 2025-05-13 15:38 ` Suren Baghdasaryan
0 siblings, 0 replies; 23+ messages in thread
From: Suren Baghdasaryan @ 2025-05-13 15:38 UTC (permalink / raw)
To: Liam R. Howlett, Lorenzo Stoakes, Andrew Morton,
David Hildenbrand, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On Tue, May 13, 2025 at 6:25 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250509 08:14]:
> > We have now introduced a mechanism that obviates the need for a reattempted
> > merge via the mmap_prepare() file hook, so eliminate this functionality
> > altogether.
> >
> > The retry merge logic has been the cause of a great deal of complexity in
> > the past and required a great deal of careful manoeuvring of code to ensure
> > its continued and correct functionality.
> >
> > It has also recently been involved in an issue surrounding maple tree
> > state, which again points to its problematic nature.
> >
> > We make it much easier to reason about mmap() logic by eliminating this and
> > simply writing a VMA once. This also opens the doors to future optimisation
> > and improvement in the mmap() logic.
> >
> > For any device or file system which encounters unwanted VMA fragmentation
> > as a result of this change (that is, having not implemented .mmap_prepare
> > hooks), the issue is easily resolvable by doing so.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Reviewed-by: David Hildenbrand <david@redhat.com>
>
> I have a few tests for the vma test suite that would test this path.
> I'll just let them go.
>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>
> > ---
> > mm/vma.c | 14 --------------
> > 1 file changed, 14 deletions(-)
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 3f32e04bb6cc..3ff6cfbe3338 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -24,7 +24,6 @@ struct mmap_state {
> > void *vm_private_data;
> >
> > unsigned long charged;
> > - bool retry_merge;
> >
> > struct vm_area_struct *prev;
> > struct vm_area_struct *next;
> > @@ -2417,8 +2416,6 @@ static int __mmap_new_file_vma(struct mmap_state *map,
> > !(map->flags & VM_MAYWRITE) &&
> > (vma->vm_flags & VM_MAYWRITE));
> >
> > - /* If the flags change (and are mergeable), let's retry later. */
> > - map->retry_merge = vma->vm_flags != map->flags && !(vma->vm_flags & VM_SPECIAL);
Could we have a WARN_ON() here please? I know you took care of in-tree
drivers but if an out-of-tree driver does this there will be no
indication that it has to change its ways. I know we don't care about
out-of-tree ones but they still exist, so maybe we can be nice here
and issue a warning?
> > map->flags = vma->vm_flags;
> >
> > return 0;
> > @@ -2622,17 +2619,6 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> > if (have_mmap_prepare)
> > set_vma_user_defined_fields(vma, &map);
> >
> > - /* If flags changed, we might be able to merge, so try again. */
> > - if (map.retry_merge) {
> > - struct vm_area_struct *merged;
> > - VMG_MMAP_STATE(vmg, &map, vma);
> > -
> > - vma_iter_config(map.vmi, map.addr, map.end);
> > - merged = vma_merge_existing_range(&vmg);
> > - if (merged)
> > - vma = merged;
> > - }
> > -
> > __mmap_complete(&map, vma);
> >
> > return addr;
> > --
> > 2.49.0
> >
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
` (3 preceding siblings ...)
2025-05-13 13:22 ` Liam R. Howlett
@ 2025-05-14 9:04 ` Pedro Falcato
2025-05-14 9:12 ` Lorenzo Stoakes
4 siblings, 1 reply; 23+ messages in thread
From: Pedro Falcato @ 2025-05-14 9:04 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
> Provide a means by which drivers can specify which fields of those
> permitted to be changed should be altered to prior to mmap()'ing a
> range (which may either result from a merge or from mapping an entirely new
> VMA).
>
> Doing so is substantially safer than the existing .mmap() calback which
> provides unrestricted access to the part-constructed VMA and permits
> drivers and file systems to do 'creative' things which makes it hard to
> reason about the state of the VMA after the function returns.
>
> The existing .mmap() callback's freedom has caused a great deal of issues,
> especially in error handling, as unwinding the mmap() state has proven to
> be non-trivial and caused significant issues in the past, for instance
> those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> error path behaviour").
>
> It also necessitates a second attempt at merge once the .mmap() callback
> has completed, which has caused issues in the past, is awkward, adds
> overhead and is difficult to reason about.
>
> The .mmap_prepare() callback eliminates this requirement, as we can update
> fields prior to even attempting the first merge. It is safer, as we heavily
> restrict what can actually be modified, and being invoked very early in the
> mmap() process, error handling can be performed safely with very little
> unwinding of state required.
>
> The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> exclusive, so we permit only one to be invoked at a time.
>
> Update vma userland test stubs to account for changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Neat idea, thanks. This should also help out with the insane proliferation of
vm_flags_set in ->mmap() callbacks all over. Hopefully.
However, could we:
1) Add a small writeup to Documentation/filesystems/vfs.rst for this callback
2) Possibly add a ->mmap_finish()? With a fully constructed vma at that point.
So things like remap_pfn_range can still be used by drivers' mmap()
implementation.
1) is particularly important so our VFS and driver friends know this is supposed
to be The Way Forward.
--
Pedro
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-14 9:04 ` Pedro Falcato
@ 2025-05-14 9:12 ` Lorenzo Stoakes
2025-05-14 10:01 ` Pedro Falcato
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-14 9:12 UTC (permalink / raw)
To: Pedro Falcato
Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On Wed, May 14, 2025 at 10:04:06AM +0100, Pedro Falcato wrote:
> On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
> > Provide a means by which drivers can specify which fields of those
> > permitted to be changed should be altered to prior to mmap()'ing a
> > range (which may either result from a merge or from mapping an entirely new
> > VMA).
> >
> > Doing so is substantially safer than the existing .mmap() calback which
> > provides unrestricted access to the part-constructed VMA and permits
> > drivers and file systems to do 'creative' things which makes it hard to
> > reason about the state of the VMA after the function returns.
> >
> > The existing .mmap() callback's freedom has caused a great deal of issues,
> > especially in error handling, as unwinding the mmap() state has proven to
> > be non-trivial and caused significant issues in the past, for instance
> > those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> > error path behaviour").
> >
> > It also necessitates a second attempt at merge once the .mmap() callback
> > has completed, which has caused issues in the past, is awkward, adds
> > overhead and is difficult to reason about.
> >
> > The .mmap_prepare() callback eliminates this requirement, as we can update
> > fields prior to even attempting the first merge. It is safer, as we heavily
> > restrict what can actually be modified, and being invoked very early in the
> > mmap() process, error handling can be performed safely with very little
> > unwinding of state required.
> >
> > The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> > exclusive, so we permit only one to be invoked at a time.
> >
> > Update vma userland test stubs to account for changes.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
>
> Neat idea, thanks. This should also help out with the insane proliferation of
> vm_flags_set in ->mmap() callbacks all over. Hopefully.
Thanks! :)
>
> However, could we:
>
> 1) Add a small writeup to Documentation/filesystems/vfs.rst for this callback
I looked at this but the problem is the docs there are already hopelessly
out of date (see the kernel version referred to).
But I'll look into how best to do the docs going forward.
>
> 2) Possibly add a ->mmap_finish()? With a fully constructed vma at that point.
> So things like remap_pfn_range can still be used by drivers' mmap()
> implementation.
Thanks for raising the remap_pfn_range() case! Yes this is definitely a
thing.
However this proposed callback would totally undermine the purpose of the
change - the idea is to never give a vma because if we do so we lose all of
the advantages here and may as well just leave the mmap in place for
this...
However I do think we'll need a new callback at some point (previously
discussed in thread).
We could perhaps provide the option to _explicitly_ remap for instance. I
would want it to be heavily locked down as to what can happen and to happen
as early as possible.
This is something we can iterate on, as trying to find the ideal scheme
immediately will just lead to inaction, the big advantage with the approach
here is we can be iterative.
We provide this, use it in a scenario which allows us to eliminate merge
retry, and can take it from there :)
So indeed, watch this space basically... I will be highly proactive on this
stuff moving forward.
>
> 1) is particularly important so our VFS and driver friends know this is supposed
> to be The Way Forward.
I think probably the answer is for me to fully update the document to be
bang up to date, right? But that would obviously work best as a follow up
patch :)
Have added to todo, so will follow up on this thanks!
>
> --
> Pedro
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-14 9:12 ` Lorenzo Stoakes
@ 2025-05-14 10:01 ` Pedro Falcato
2025-05-14 10:14 ` Lorenzo Stoakes
0 siblings, 1 reply; 23+ messages in thread
From: Pedro Falcato @ 2025-05-14 10:01 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On Wed, May 14, 2025 at 10:12:49AM +0100, Lorenzo Stoakes wrote:
> On Wed, May 14, 2025 at 10:04:06AM +0100, Pedro Falcato wrote:
> > On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
> > > Provide a means by which drivers can specify which fields of those
> > > permitted to be changed should be altered to prior to mmap()'ing a
> > > range (which may either result from a merge or from mapping an entirely new
> > > VMA).
> > >
> > > Doing so is substantially safer than the existing .mmap() calback which
> > > provides unrestricted access to the part-constructed VMA and permits
> > > drivers and file systems to do 'creative' things which makes it hard to
> > > reason about the state of the VMA after the function returns.
> > >
> > > The existing .mmap() callback's freedom has caused a great deal of issues,
> > > especially in error handling, as unwinding the mmap() state has proven to
> > > be non-trivial and caused significant issues in the past, for instance
> > > those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> > > error path behaviour").
> > >
> > > It also necessitates a second attempt at merge once the .mmap() callback
> > > has completed, which has caused issues in the past, is awkward, adds
> > > overhead and is difficult to reason about.
> > >
> > > The .mmap_prepare() callback eliminates this requirement, as we can update
> > > fields prior to even attempting the first merge. It is safer, as we heavily
> > > restrict what can actually be modified, and being invoked very early in the
> > > mmap() process, error handling can be performed safely with very little
> > > unwinding of state required.
> > >
> > > The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> > > exclusive, so we permit only one to be invoked at a time.
> > >
> > > Update vma userland test stubs to account for changes.
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> >
> > Neat idea, thanks. This should also help out with the insane proliferation of
> > vm_flags_set in ->mmap() callbacks all over. Hopefully.
[snip]
> >
> > 2) Possibly add a ->mmap_finish()? With a fully constructed vma at that point.
> > So things like remap_pfn_range can still be used by drivers' mmap()
> > implementation.
>
> Thanks for raising the remap_pfn_range() case! Yes this is definitely a
> thing.
>
> However this proposed callback would totally undermine the purpose of the
> change - the idea is to never give a vma because if we do so we lose all of
> the advantages here and may as well just leave the mmap in place for
> this...
>
Yes, good point.
> However I do think we'll need a new callback at some point (previously
> discussed in thread).
>
> We could perhaps provide the option to _explicitly_ remap for instance. I
> would want it to be heavily locked down as to what can happen and to happen
> as early as possible.
>
I think we can simply combine various ideas here. Like:
struct vm_area_desc_private {
struct vm_area_desc desc;
struct vm_area_struct *vma;
};
Then, for this "mmap_finish" callback and associated infra:
int (*mmap_finish)(struct vm_area_desc *desc);
int mmap_remap_pfn_range(struct vm_area_desc *desc, /*...*/)
{
struct vm_area_desc_private *private = container_of(desc, struct vm_area_desc_private, desc);
return remap_pfn_range(private->vma, /*...*/);
}
int random_driver_mmap_finish(struct vm_area_desc *desc)
{
return mmap_remap_pfn_range(desc, desc->start, some_pfn, some_size,
desc->page_prot);
}
I think something of the sort would be quite less prone to abuse, and we could
take the time to then even polish up the interface (e.g maybe it would be nicer
if mmap_remap_pfn_range took a vma offset, and not a start address).
Anyway, just brainstorming. This idea came to mind, I think it's quite interesting.
> This is something we can iterate on, as trying to find the ideal scheme
> immediately will just lead to inaction, the big advantage with the approach
> here is we can be iterative.
>
> We provide this, use it in a scenario which allows us to eliminate merge
> retry, and can take it from there :)
Yep, totally.
>
> So indeed, watch this space basically... I will be highly proactive on this
> stuff moving forward.
>
> >
> > 1) is particularly important so our VFS and driver friends know this is supposed
> > to be The Way Forward.
>
> I think probably the answer is for me to fully update the document to be
> bang up to date, right? But that would obviously work best as a follow up
> patch :)
>
You love your big projects :p I had the impression the docs were more or less up to date?
The VFS people do update it somewhat diligently. And for mm we only have ->mmap, ->get_unmapped_area,
and now ->mmap_prepare. And the descriptions are ATM quite useless, just
"called by the mmap(2) system call".
--
Pedro
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback
2025-05-14 10:01 ` Pedro Falcato
@ 2025-05-14 10:14 ` Lorenzo Stoakes
0 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-05-14 10:14 UTC (permalink / raw)
To: Pedro Falcato
Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox
On Wed, May 14, 2025 at 11:01:49AM +0100, Pedro Falcato wrote:
> On Wed, May 14, 2025 at 10:12:49AM +0100, Lorenzo Stoakes wrote:
> > On Wed, May 14, 2025 at 10:04:06AM +0100, Pedro Falcato wrote:
> > > On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote:
> > > > Provide a means by which drivers can specify which fields of those
> > > > permitted to be changed should be altered to prior to mmap()'ing a
> > > > range (which may either result from a merge or from mapping an entirely new
> > > > VMA).
> > > >
> > > > Doing so is substantially safer than the existing .mmap() calback which
> > > > provides unrestricted access to the part-constructed VMA and permits
> > > > drivers and file systems to do 'creative' things which makes it hard to
> > > > reason about the state of the VMA after the function returns.
> > > >
> > > > The existing .mmap() callback's freedom has caused a great deal of issues,
> > > > especially in error handling, as unwinding the mmap() state has proven to
> > > > be non-trivial and caused significant issues in the past, for instance
> > > > those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> > > > error path behaviour").
> > > >
> > > > It also necessitates a second attempt at merge once the .mmap() callback
> > > > has completed, which has caused issues in the past, is awkward, adds
> > > > overhead and is difficult to reason about.
> > > >
> > > > The .mmap_prepare() callback eliminates this requirement, as we can update
> > > > fields prior to even attempting the first merge. It is safer, as we heavily
> > > > restrict what can actually be modified, and being invoked very early in the
> > > > mmap() process, error handling can be performed safely with very little
> > > > unwinding of state required.
> > > >
> > > > The .mmap_prepare() and deprecated .mmap() callbacks are mutually
> > > > exclusive, so we permit only one to be invoked at a time.
> > > >
> > > > Update vma userland test stubs to account for changes.
> > > >
> > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > >
> > > Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> > >
> > > Neat idea, thanks. This should also help out with the insane proliferation of
> > > vm_flags_set in ->mmap() callbacks all over. Hopefully.
> [snip]
> > >
> > > 2) Possibly add a ->mmap_finish()? With a fully constructed vma at that point.
> > > So things like remap_pfn_range can still be used by drivers' mmap()
> > > implementation.
> >
> > Thanks for raising the remap_pfn_range() case! Yes this is definitely a
> > thing.
> >
> > However this proposed callback would totally undermine the purpose of the
> > change - the idea is to never give a vma because if we do so we lose all of
> > the advantages here and may as well just leave the mmap in place for
> > this...
> >
>
> Yes, good point.
>
> > However I do think we'll need a new callback at some point (previously
> > discussed in thread).
> >
> > We could perhaps provide the option to _explicitly_ remap for instance. I
> > would want it to be heavily locked down as to what can happen and to happen
> > as early as possible.
> >
>
> I think we can simply combine various ideas here. Like:
>
> struct vm_area_desc_private {
> struct vm_area_desc desc;
> struct vm_area_struct *vma;
> };
>
> Then, for this "mmap_finish" callback and associated infra:
>
> int (*mmap_finish)(struct vm_area_desc *desc);
>
> int mmap_remap_pfn_range(struct vm_area_desc *desc, /*...*/)
> {
> struct vm_area_desc_private *private = container_of(desc, struct vm_area_desc_private, desc);
> return remap_pfn_range(private->vma, /*...*/);
> }
>
> int random_driver_mmap_finish(struct vm_area_desc *desc)
> {
> return mmap_remap_pfn_range(desc, desc->start, some_pfn, some_size,
> desc->page_prot);
> }
>
>
Yeah that's really nice actually! I like that. Will note down for when we
come to this.
Obviously we need to know whent the finish implies a remap, but we can have
a field for this potentially set in the vm_area_desc to indicate this.
> I think something of the sort would be quite less prone to abuse, and we could
> take the time to then even polish up the interface (e.g maybe it would be nicer
> if mmap_remap_pfn_range took a vma offset, and not a start address).
Right yeah, David definitely wanted some improvements in this area so this
aligns with other intentions/work.
>
> Anyway, just brainstorming. This idea came to mind, I think it's quite interesting.
Nice ideas :) yeah, good chance to undo some sins here.
>
> > This is something we can iterate on, as trying to find the ideal scheme
> > immediately will just lead to inaction, the big advantage with the approach
> > here is we can be iterative.
> >
> > We provide this, use it in a scenario which allows us to eliminate merge
> > retry, and can take it from there :)
>
> Yep, totally.
>
> >
> > So indeed, watch this space basically... I will be highly proactive on this
> > stuff moving forward.
> >
> > >
> > > 1) is particularly important so our VFS and driver friends know this is supposed
> > > to be The Way Forward.
> >
> > I think probably the answer is for me to fully update the document to be
> > bang up to date, right? But that would obviously work best as a follow up
> > patch :)
> >
>
> You love your big projects :p I had the impression the docs were more or less up to date?
> The VFS people do update it somewhat diligently. And for mm we only have ->mmap, ->get_unmapped_area,
> and now ->mmap_prepare. And the descriptions are ATM quite useless, just
> "called by the mmap(2) system call".
Much like leg day, sometimes painful things are worthwhile things :)
As for docs I saw there were some missing uring entries, and it explicitly
says 'as of kernel 4.18'.
So I feel that it really needs to be properly updated and it ends up being
a little out of scope to the series.
Since this series only uses this callback in one, mm-specific, place, I
think it makes sense to attack this on a follow up, alongside updating file
system callbacks etc.
But it's definitely on the todo! :)
Cheers, Lorenzo
>
> --
> Pedro
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2025-05-14 10:15 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-09 12:13 [PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook Lorenzo Stoakes
2025-05-09 12:13 ` [PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes
2025-05-12 9:24 ` Christian Brauner
2025-05-12 11:31 ` Lorenzo Stoakes
2025-05-13 7:29 ` Christian Brauner
2025-05-12 13:37 ` Vlastimil Babka
2025-05-13 9:01 ` David Hildenbrand
2025-05-13 9:32 ` Lorenzo Stoakes
2025-05-13 13:25 ` David Hildenbrand
2025-05-13 15:31 ` Suren Baghdasaryan
2025-05-13 13:22 ` Liam R. Howlett
2025-05-14 9:04 ` Pedro Falcato
2025-05-14 9:12 ` Lorenzo Stoakes
2025-05-14 10:01 ` Pedro Falcato
2025-05-14 10:14 ` Lorenzo Stoakes
2025-05-09 12:13 ` [PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook Lorenzo Stoakes
2025-05-12 13:41 ` Vlastimil Babka
2025-05-13 13:23 ` Liam R. Howlett
2025-05-13 15:32 ` Suren Baghdasaryan
2025-05-09 12:13 ` [PATCH v2 3/3] mm/vma: remove mmap() retry merge Lorenzo Stoakes
2025-05-12 13:44 ` Vlastimil Babka
2025-05-13 13:25 ` Liam R. Howlett
2025-05-13 15:38 ` Suren Baghdasaryan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox